### Para começar o pré-processamento de dados nós iremos importar os módulos que iremos utilizar

- Biblioteca pandas será utilizada para manipular o dataset
- Biblioteca Scikit-Learn será utilizada para padronizar os dados 
- Módulo local será utilizado para codificar os labels (palavras -> números)

In [15]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from label_encoder import LabelEncoder

### Vamos definir algumas variáveis globais que irão nos ajudar mais para frente

In [16]:
K = 1000
M = 1000000

### Vamos ler o nosso arquivo csv contendo os nossos dados e remover todas as linhas onde existe algum valor faltando

In [17]:
df = pd.read_csv('../dataset.csv')
df = df.dropna()

### Removeremos as linhas que possuem como target o valor "Unranked", pois não é do nosso interesse utilizá-lo

In [18]:
df = df[df.elo != 'Unranked']

### Instanciaremos duas funções que irão fazer tratamentos sobre a nossa base de dados, essas funções foram criadas porque são específicas para os nossos dados


#### Uma breve explicação de cada uma:
- convert_to_float: Nossa base de dados possui valores simplificados, ou seja, invés de estar escrito 1000 (mil) está escrito 1k. Como é do nosso interesse a conversão para numerais, essa função converte todas as strings em números


- remove_excess: A coluna target estava com informações adicionais que não eram do nosso interesse prever devido a pequena quantidade de dados. Essas informações eram: Bronze I, Bronze II, Prata III, Diamante I, etc. Para solucionar o problema, essa função foi criada para manter somente a primeira palavra, ou seja, Bronze, Prata, Diamente, etc.

In [19]:
def convert_to_float(column):
    values = []
    for line in column:
        if str(line).lower().__contains__('m'):
            line = str(line).lower().replace('m', '').replace('m', '').replace('m', '')
            values.append(float(line) * M)
        elif str(line).lower().__contains__('k'):
            line = str(line).lower().replace('k', '').replace('k', '').replace('k', '')
            values.append(float(line) * K)
        else:
            values.append(float(line))

    return values

def remove_excess(column):
    values = []
    for idx, line in column.iterrows():
        line = list(line)[0]
        if str(line).lower().startswith('bronze') or str(line).lower().endswith('bronze'):
            values.append('Bronze')
        elif str(line).lower().startswith('silver') or str(line).lower().endswith('silver'):
            values.append('Silver')
        elif str(line).lower().startswith('gold') or str(line).lower().endswith('gold'):
            values.append('Gold')
        elif str(line).lower().startswith('platinum') or str(line).lower().endswith('platinum'):
            values.append('Platinum')
        elif str(line).lower().startswith('diamond') or str(line).lower().endswith('diamond'):
            values.append('Diamond')
        elif str(line).lower().startswith('master') or str(line).lower().endswith('master'):
            values.append('Master')
        elif str(line).lower().startswith('challenger') or str(line).lower().endswith('challenger'):
            values.append('Challenger')

    return values

### Vamos criar um dataset com todos os dados númericos e sem a coluna target para realizar a padronização

In [20]:
df_without_champions = df.drop(['elo', 'champion_1', 'champion_2', 'champion_3'], axis=1)
df_without_champions_columns = df_without_champions.columns
df_without_champions = df_without_champions.apply(convert_to_float)

### Utilizamos o objeto StandardScaler para transformar nossos dados numéricos para uma escala padronizada

In [21]:
scale = StandardScaler()
scaled_df = pd.DataFrame(scale.fit_transform(df_without_champions), columns=list(df_without_champions_columns))

### Transformamos as colunas categóricas utilizando o método OneHot

#### Algumas observações:
1. A classe OneHotEncoder do Scikit-Learn só consegue lidar com dados numéricos, dessa forma é necessário a conversa da string para número com o LabelEncoder


2. O LabelEncoder da biblioteca sklearn estava gerando incongruências, dessa forma foi implementada a nossa versão da classe


3. Foi feito a conversão para OneHot de cada coluna invés de todas ao mesmo tempo para preservar a organização dos headers

In [22]:
one_hot_encoder = OneHotEncoder(categories='auto')
label_encoder = LabelEncoder()

champion_3 = df[['champion_3']]
champion_2 = df[['champion_2']]
champion_1 = df[['champion_1']]

champion_3 = pd.DataFrame(label_encoder.fit_transform(champion_3))
champion_2 = pd.DataFrame(label_encoder.fit_transform(champion_2))
champion_1 = pd.DataFrame(label_encoder.fit_transform(champion_1))

champion_3 = pd.DataFrame(one_hot_encoder.fit_transform(champion_3).toarray())
size_3 = len(champion_3.columns)

champion_2 = pd.DataFrame(one_hot_encoder.fit_transform(champion_2).toarray())
size_2 = len(champion_2.columns)
champion_2.columns = list(range(size_3, size_3 + size_2))

champion_1 = pd.DataFrame(one_hot_encoder.fit_transform(champion_1).toarray())
size_1 = len(champion_1.columns)
champion_1.columns = list(range(size_3 + size_2, size_3 + size_2 + size_1))

### Instancia um dataset contendo somente a coluna alvo limpa com os elos simplificados

In [23]:
target_column = pd.DataFrame(remove_excess(df[['elo']]), columns=['elo'])

### Concatena o dataset da coluna alvo com o dataset padronizado e os datasets com OneHot

In [24]:
df = pd.concat([target_column, scaled_df, champion_3, champion_2, champion_1], axis=1).dropna()
df.head(50)

Unnamed: 0,elo,games,remakes,playing_time,kills,deaths,assists,gold,pentakills,wards,...,417,418,419,420,421,422,423,424,425,426
0,Bronze,0.804901,3.105339,-0.424502,1.012345,1.53094,0.721662,0.785667,-0.17599,0.618892,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bronze,-0.433867,0.37607,-0.33632,-0.405432,-0.451647,-0.447535,-0.429137,-0.17599,-0.419391,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bronze,-0.37625,-0.013825,-0.081573,-0.341197,-0.342468,-0.342885,-0.366599,-0.17599,-0.322206,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bronze,0.142304,0.37607,-0.522481,-0.235666,0.312604,0.422146,0.003549,-0.17599,0.076377,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bronze,-0.505888,-0.403721,0.90802,-0.476551,-0.506237,-0.525121,-0.50705,-0.17599,-0.472289,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Bronze,-0.520293,-0.403721,-0.620461,-0.478845,-0.513681,-0.525121,-0.511219,-0.17599,-0.484591,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Bronze,-0.520293,-0.403721,-0.620461,-0.478845,-0.513681,-0.525121,-0.511219,-0.17599,-0.484591,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Bronze,-0.520293,-0.403721,-0.620461,-0.478845,-0.513681,-0.525121,-0.511219,-0.17599,-0.484591,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Gold,0.271943,1.545757,-0.502885,-0.098018,0.076877,0.990506,0.190903,-0.17599,1.23768,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Silver,-0.520293,-0.403721,-0.620461,-0.478845,-0.513681,-0.525121,-0.511219,-0.17599,-0.484591,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### Salva no arquivo "clean_dataset.csv" somente os dados, desconsiderando o índice das linhas

In [25]:
df.to_csv('../clean_dataset.csv', index=False)