<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hero-Stats-Transformer" data-toc-modified-id="Hero-Stats-Transformer-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hero Stats Transformer</a></span></li><li><span><a href="#Partidas-Dota-2---Transfomador" data-toc-modified-id="Partidas-Dota-2---Transfomador-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Partidas Dota 2 - Transfomador</a></span></li><li><span><a href="#Join-entre-dataframes" data-toc-modified-id="Join-entre-dataframes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Join entre dataframes</a></span><ul class="toc-item"><li><span><a href="#Filtrando-game_mode-e-lobby_type" data-toc-modified-id="Filtrando-game_mode-e-lobby_type-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Filtrando game_mode e lobby_type</a></span></li><li><span><a href="#Unificando-datastes" data-toc-modified-id="Unificando-datastes-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Unificando datastes</a></span></li></ul></li><li><span><a href="#Salvando-dataset" data-toc-modified-id="Salvando-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Salvando dataset</a></span></li></ul></div>

---
# Transformador 2: Anexar os status dos 10 heróis para cada um dos dois times

**Objetivo do experimento:** Anexar os status de cada time ao dataset de treinamento. As seguintes soma features de features por herói serão adicionadas:

- Is_Melle (quantidade de heróis de curto e longo alcance)
- Roles (Número total de roles por time)
- Quantidade de heróis por atributo principal: str, int, agi


Abaixo a *Ordem de procedimentos:*

**Arquivo Hero_stats**
1. Carregar o arquivo hero_stats
2. Filtrar as colunas desejadas
3. Aplicar a transformação nos dados de hero_stats

**Arquivos de partidas**
1. Procurar por todos os dados disponíveis na pasta 'raw_data'
2. Concatenar todos em um único dataframe
3. Remover valores nulos
4. Remover colunas duplicadas
5. Adicionar uma coluna para um herói de cada time

**União de dados**
1. Filtrar apenas por partidas rankeadas e game_mode = 'game_mode_all_draft'
2. Fazer um join entre os datasets
3. Tratar para somar as colunas desejadas de cada time

**Loading libraries and ``hero_stats`` JSON**

In [1]:
data_path = '../data/raw_data'

import pandas as pd
import numpy as np
from datetime import datetime
import os

import sys
sys.path.append('../utils')
import transformer_utils

## Hero Stats Transformer

In [2]:
hero_status_columns =['hero_id', 'is_Melle', 'primary_attr_agi', 'primary_attr_int',
       'primary_attr_str', 'Nuker', 'Disabler', 'Initiator', 'Durable',
       'Support', 'Jungler', 'Carry', 'Pusher', 'Escape']

In [3]:
hero_stats_raw = pd.read_json(data_path+'/hero_stats.json')
hero_stats_df = transformer_utils.hero_stats_tranformer(hero_stats_raw)
hero_stats_df = hero_stats_df.loc[:, hero_status_columns]
hero_stats_df.head()

Unnamed: 0,hero_id,is_Melle,primary_attr_agi,primary_attr_int,primary_attr_str,Nuker,Disabler,Initiator,Durable,Support,Jungler,Carry,Pusher,Escape
0,1,1,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,2,1,0,0,1,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
2,3,0,0,1,0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,4,1,1,0,0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
4,5,0,0,1,0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


---
## Partidas Dota 2 - Transfomador 

**Searching for each .csv file in the 'raw_data' folder**

In [4]:
lst_df = []
for root, dirs, files in os.walk(data_path):
    for filename in files:
        xlsx_file, file_extension = os.path.splitext(filename)
        print('.csv file found:\n')
        if(file_extension == '.csv'):
            print(filename)
            file_path = root +'\\' + filename
            lst_df.append(pd.read_csv(file_path))  

.csv file found:

21-04-24 16h14m31s.csv
.csv file found:

.csv file found:

.csv file found:



**Drop NA**

In [5]:
match_df = pd.concat(lst_df)

print('Dataframe shape:', match_df.shape)
print('Total nan: \n\n', match_df.isna().sum())

match_df.dropna(inplace=True)
print('\nDataframe shape:', match_df.shape)

Dataframe shape: (46000, 9)
Total nan: 

 Unnamed: 0      0
match_id        0
radiant_win     0
avg_mmr         0
duration        0
lobby_type      0
game_mode       0
radiant_team    0
dire_team       0
dtype: int64

Dataframe shape: (46000, 9)


**Remove duplicated rows**

In [6]:
match_df.drop_duplicates(subset=['match_id'])
print('\nDataframe shape:', match_df.shape)


Dataframe shape: (46000, 9)


---
## Summarize hero stats
1. **``create_features_df``**: cria um dataframe para os dados serem sumarizados; 
2. **``agg_by_team``:** para cada linha, faz uma busca do status do time dentro do dataframe de hero_status e trazer a informação agregada;
3. **``populate_feature_df``**: adiciona o resultado agregado ao dataframe final.

In [7]:
def create_features_df(match_df, hero_stats_df):
    # Creating dataframe for features
    hero_stats_feature_cols = hero_stats_df.columns[1:]

    feature_cols = list('Radiant_'+hero_stats_feature_cols) \
                    + list('Dire_'+hero_stats_feature_cols)


    zero_mtx = np.zeros([match_df.shape[0], len(feature_cols)])
    feature_df = pd.DataFrame(zero_mtx, columns=feature_cols)
    
    return feature_df

In [8]:
def agg_by_team(team_heros_id, hero_stats_df, agg_function='sum'):
    
    intermediate_df = hero_stats_df.query("hero_id in @team_heros_id") \
                                .loc[:, hero_stats_df.columns != 'hero_id']
    
    agg_row = intermediate_df.agg(agg_function)
    
    return agg_row

In [9]:
def populate_feature_df(match_df, agg_function = 'sum'):

    for i in np.arange(match_df.shape[0]):
        row = match_df.iloc[i,:]

        radiant_team_str = map(int, row['radiant_team'].split(","))
        radiant_team = list(radiant_team_str)

        dire_team_str = map(int, row['dire_team'].split(","))
        dire_team = list(dire_team_str)

        agg_row_radiant = agg_by_team(radiant_team, hero_stats_df, agg_function) \
                            .add_prefix('Radiant_')

        agg_row_dire = agg_by_team(dire_team, hero_stats_df,agg_function) \
                            .add_prefix('Dire_')        

        feature_df.loc[i, agg_row_radiant.index] = agg_row_radiant
        feature_df.loc[i, agg_row_dire.index] = agg_row_dire
    
    return feature_df

In [10]:
feature_df = create_features_df(match_df, hero_stats_df)
feature_df = populate_feature_df(match_df, agg_function = 'sum')
feature_df.head()

Unnamed: 0,Radiant_is_Melle,Radiant_primary_attr_agi,Radiant_primary_attr_int,Radiant_primary_attr_str,Radiant_Nuker,Radiant_Disabler,Radiant_Initiator,Radiant_Durable,Radiant_Support,Radiant_Jungler,...,Dire_primary_attr_str,Dire_Nuker,Dire_Disabler,Dire_Initiator,Dire_Durable,Dire_Support,Dire_Jungler,Dire_Carry,Dire_Pusher,Dire_Escape
0,2.0,1.0,3.0,1.0,3.0,5.0,2.0,2.0,2.0,2.0,...,2.0,1.0,3.0,2.0,3.0,2.0,1.0,3.0,1.0,2.0
1,3.0,2.0,1.0,2.0,5.0,4.0,1.0,3.0,3.0,0.0,...,3.0,3.0,4.0,3.0,4.0,1.0,0.0,4.0,0.0,3.0
2,3.0,2.0,1.0,2.0,3.0,4.0,3.0,2.0,1.0,1.0,...,1.0,2.0,4.0,1.0,1.0,2.0,0.0,3.0,0.0,3.0
3,2.0,3.0,1.0,1.0,3.0,3.0,2.0,1.0,2.0,0.0,...,2.0,2.0,4.0,2.0,2.0,1.0,0.0,3.0,1.0,1.0
4,3.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,...,2.0,3.0,4.0,1.0,2.0,3.0,0.0,4.0,1.0,1.0


## Incluir dados restantes


In [11]:
cols_to_include = ['match_id', 'radiant_win', 'avg_mmr',
                   'duration', 'lobby_type', 'game_mode']

feature_df[cols_to_include] = match_df[cols_to_include]

In [12]:
feature_df.head()

Unnamed: 0,Radiant_is_Melle,Radiant_primary_attr_agi,Radiant_primary_attr_int,Radiant_primary_attr_str,Radiant_Nuker,Radiant_Disabler,Radiant_Initiator,Radiant_Durable,Radiant_Support,Radiant_Jungler,...,Dire_Jungler,Dire_Carry,Dire_Pusher,Dire_Escape,match_id,radiant_win,avg_mmr,duration,lobby_type,game_mode
0,2.0,1.0,3.0,1.0,3.0,5.0,2.0,2.0,2.0,2.0,...,1.0,3.0,1.0,2.0,5963069008,True,3439,1649,7,22
1,3.0,2.0,1.0,2.0,5.0,4.0,1.0,3.0,3.0,0.0,...,0.0,4.0,0.0,3.0,5963069009,True,3774,1848,0,22
2,3.0,2.0,1.0,2.0,3.0,4.0,3.0,2.0,1.0,1.0,...,0.0,3.0,0.0,3.0,5963069012,True,3311,1951,7,22
3,2.0,3.0,1.0,1.0,3.0,3.0,2.0,1.0,2.0,0.0,...,0.0,3.0,1.0,1.0,5963069204,True,3408,995,0,22
4,3.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,...,0.0,4.0,1.0,1.0,5963069208,False,4621,1818,7,22


**Converting string to numerical**

In [13]:
feature_df['radiant_win'] = feature_df['radiant_win'].astype(int)
feature_df = feature_df.apply(pd.to_numeric)
feature_df

Unnamed: 0,Radiant_is_Melle,Radiant_primary_attr_agi,Radiant_primary_attr_int,Radiant_primary_attr_str,Radiant_Nuker,Radiant_Disabler,Radiant_Initiator,Radiant_Durable,Radiant_Support,Radiant_Jungler,...,Dire_Jungler,Dire_Carry,Dire_Pusher,Dire_Escape,match_id,radiant_win,avg_mmr,duration,lobby_type,game_mode
0,2.0,1.0,3.0,1.0,3.0,5.0,2.0,2.0,2.0,2.0,...,1.0,3.0,1.0,2.0,5963069008,1,3439,1649,7,22
1,3.0,2.0,1.0,2.0,5.0,4.0,1.0,3.0,3.0,0.0,...,0.0,4.0,0.0,3.0,5963069009,1,3774,1848,0,22
2,3.0,2.0,1.0,2.0,3.0,4.0,3.0,2.0,1.0,1.0,...,0.0,3.0,0.0,3.0,5963069012,1,3311,1951,7,22
3,2.0,3.0,1.0,1.0,3.0,3.0,2.0,1.0,2.0,0.0,...,0.0,3.0,1.0,1.0,5963069204,1,3408,995,0,22
4,3.0,2.0,1.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,...,0.0,4.0,1.0,1.0,5963069208,0,4621,1818,7,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45995,2.0,3.0,2.0,0.0,3.0,3.0,2.0,1.0,1.0,2.0,...,0.0,4.0,2.0,0.0,5960541818,0,3414,2507,7,22
45996,2.0,1.0,3.0,1.0,3.0,3.0,1.0,2.0,3.0,1.0,...,0.0,3.0,2.0,1.0,5960541916,1,3349,2156,7,22
45997,1.0,1.0,3.0,1.0,5.0,3.0,2.0,1.0,2.0,0.0,...,0.0,4.0,0.0,3.0,5960542119,0,3448,2534,7,22
45998,3.0,3.0,0.0,2.0,5.0,3.0,2.0,2.0,1.0,0.0,...,0.0,1.0,1.0,2.0,5960542310,1,3702,1650,7,3


## Salvando dataset
**Saving data frame on 'working data' folder**

In [14]:
working_data_path = '../data/working_data/2_TRA_v2_'
start_file = datetime.now().strftime("%Y-%m-%d")
output_file = working_data_path + start_file + '_working_data.csv'

feature_df.to_csv(output_file, index=False)