<a href="https://colab.research.google.com/github/obraia/xgboost-tdc/blob/main/XGBoost_TDC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<header>
  <img src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg" width="100" align="left" hspace="10px" vspace="0px"/>
  
  <h1><b>XGBoost</b> - TDC</h1>
  <h4>Bryan Diniz, Débora Oliveira, Isabela Fonseca e Thais Lorentz</h4>
</header>

<p>e<b>X</b>treme <b>G</b>radient <b>Boost</b>ing</p>

<h3><b>1 - Importação das bibliotecas necessárias</b></h3>

<p></p>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from xgboost import XGBRegressor
from google.colab import files
from io import BytesIO

print('Bibliotecas importadas!')

<h3><b>2 - Importação do dataset</b></h3>

<p>É possível carregar o arquivo .csv localmente.</p>

In [68]:
arquivos = files.upload()
nomeArquivo = list(arquivos.keys())[0]

if not nomeArquivo.endswith('.csv'):
  raise Exception('Tipo de arquivo inválido!')

arquivo = arquivos[nomeArquivo]

df = pd.read_csv(BytesIO(arquivo))

print('Dataset "{nome}" importado!'.format(nome=nomeArquivo))

Saving stars-types.csv to stars-types (5).csv
Dataset "stars-types.csv" importado!


<h3><b>3 - Exibição das 5 primeiras instâncias do dataset</b></h3>

<p></p>

In [69]:
df.head()

Unnamed: 0,temperature,luminosity,radius,absolute_magnitude,star_type,star_color,spectral_class
0,3068,0.0024,0.17,16.12,0,red,M
1,3042,0.0005,0.1542,16.6,0,red,M
2,2600,0.0003,0.102,18.7,0,red,M
3,2800,0.0002,0.16,16.65,0,red,M
4,1939,0.000138,0.103,20.06,0,red,M


<h3><b>4 - Exibição de informações dos atributos do dataset</b></h3>

<p></p>

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   temperature         240 non-null    int64  
 1   luminosity          240 non-null    float64
 2   radius              240 non-null    float64
 3   absolute_magnitude  240 non-null    float64
 4   star_type           240 non-null    int64  
 5   star_color          240 non-null    object 
 6   spectral_class      240 non-null    object 
dtypes: float64(3), int64(2), object(2)
memory usage: 13.2+ KB


<h3><b>5 - Pré-processamento dos dados</b></h3>

<p>Os dados do dataset serão processado de forma que fique compatível com o algorítmo XGBoost, e buscando os melhores resultados</p>

<h4><b>5.1 - Tratamento de valores ausentes</b></h4>

<p></p>

In [71]:
def print_valores_ausentes():
  print(df.isnull().sum())

print_valores_ausentes()

temperature           0
luminosity            0
radius                0
absolute_magnitude    0
star_type             0
star_color            0
spectral_class        0
dtype: int64


<h5><b>5.1.2 - Exclusão de instâncias com valores ausentes</b></h5>

<p></p>

In [None]:
df.dropna(inplace=True)

print_valores_ausentes()

<h5><b>5.1.3 - Preenchimento de valores ausentes com a média</b></h5>

<p></p>

In [None]:
media_temperature = round(df.temperature.mean())

df.temperature.fillna(media_temperature, inplace=True)

<h5><b>5.1.4 - Preenchimento de valores ausentes com a mediana</b></h5>

<p></p>

In [None]:
mediana_temperature = df.temperature.median()

df.temperature.fillna(mediana_temperature, inplace=True)

<h5><b>5.1.5 - Preenchimento de valores ausentes com a moda (Valores que mais se repetem)</b></h5>

<p></p>

In [None]:
moda_star_color = df.star_color.value_counts()[0]

df.star_color.fillna(moda_star_color, inplace=True)

<h4><b>5.2 - Converter variáveis categóricas para numéricas</b></h4>

<p>Como o algoritmo XGBoost só trabalha com valores númericos, os próximos passos serão de transformações dos atributos</p>

<h5><b>5.2.1 - Conversão de atributos categóricos para atributos simbólicos</b></h5>

<p>Nessa etapa iremos transformar o atributo <b>star_color</b> usando o conceito de <b>conversão simbólico-numérico</b></p>

In [72]:
df_2 = pd.get_dummies(df, columns=['star_color'], drop_first=True)

df_2.head()

Unnamed: 0,temperature,luminosity,radius,absolute_magnitude,star_type,spectral_class,star_color_blue,star_color_blue.1,star_color_blue_white,star_color_orange_red,star_color_pale_yellow_orange,star_color_red,star_color_white,star_color_white_yellow,star_color_whitish,star_color_yellow_white,star_color_yellowish,star_color_yellowish_white
0,3068,0.0024,0.17,16.12,0,M,0,0,0,0,0,1,0,0,0,0,0,0
1,3042,0.0005,0.1542,16.6,0,M,0,0,0,0,0,1,0,0,0,0,0,0
2,2600,0.0003,0.102,18.7,0,M,0,0,0,0,0,1,0,0,0,0,0,0
3,2800,0.0002,0.16,16.65,0,M,0,0,0,0,0,1,0,0,0,0,0,0
4,1939,0.000138,0.103,20.06,0,M,0,0,0,0,0,1,0,0,0,0,0,0


<h5><b>5.2.1 - Conversão da classe categórica para numérica</b></h5>

<p>Diferente da etapa anterior, nossa classe pode manter uma relação númerica, isso não trará problemas na classificação, logo vamos usar uma conversão simples para atender ao algoritmo</p>

In [73]:
df_3 = df_2.copy()

# transformação da nossa classe para uma variável category do pandas
df_3['spectral_class'] = df_3['spectral_class'].astype('category')

# conversão para inteiro
df_3['spectral_class'] = df_3['spectral_class'].cat.codes

df_3.head()

Unnamed: 0,temperature,luminosity,radius,absolute_magnitude,star_type,spectral_class,star_color_blue,star_color_blue.1,star_color_blue_white,star_color_orange_red,star_color_pale_yellow_orange,star_color_red,star_color_white,star_color_white_yellow,star_color_whitish,star_color_yellow_white,star_color_yellowish,star_color_yellowish_white
0,3068,0.0024,0.17,16.12,0,5,0,0,0,0,0,1,0,0,0,0,0,0
1,3042,0.0005,0.1542,16.6,0,5,0,0,0,0,0,1,0,0,0,0,0,0
2,2600,0.0003,0.102,18.7,0,5,0,0,0,0,0,1,0,0,0,0,0,0
3,2800,0.0002,0.16,16.65,0,5,0,0,0,0,0,1,0,0,0,0,0,0
4,1939,0.000138,0.103,20.06,0,5,0,0,0,0,0,1,0,0,0,0,0,0


<h5><b>5.2.2 - Visualização de como ficou o dataset após os tratamentos</b></h5>

<p></p>

In [74]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   temperature                    240 non-null    int64  
 1   luminosity                     240 non-null    float64
 2   radius                         240 non-null    float64
 3   absolute_magnitude             240 non-null    float64
 4   star_type                      240 non-null    int64  
 5   spectral_class                 240 non-null    int8   
 6   star_color_blue                240 non-null    uint8  
 7   star_color_blue                240 non-null    uint8  
 8   star_color_blue_white          240 non-null    uint8  
 9   star_color_orange_red          240 non-null    uint8  
 10  star_color_pale_yellow_orange  240 non-null    uint8  
 11  star_color_red                 240 non-null    uint8  
 12  star_color_white               240 non-null    uin

<h5><b>5.2.3 Divisão do dataset</b></h5>

<p>Etapa onde os dados serão separados em conjuntos de treino, testes e a classe</p>

In [75]:
y = df_3['spectral_class']
x = df_3.drop(['spectral_class'], axis=1)

# dividir entre conjuntos de treino e teste
train_x, test_x, train_y, test_y = train_test_split(x.values, y.values, test_size=0.33, random_state=7)

<h3><b>6 - Implementação do XGBoost</b></h3>

<p></p>

<h4><b>6.1 Inicio do treino do modelo</b></h4>

<p></p>

In [76]:
model = XGBRegressor()
 
model.fit(train_x, train_y, verbose=False)
 
y_pred = model.predict(test_x)
predictions = [round(value) for value in y_pred]

absolute_error = mean_absolute_error(predictions, test_y)
accuracy = accuracy_score(test_y, predictions)
 
print("Erro médio absoluto: {:.2f}".format(absolute_error))
print("Accuracy: {:.2f}%".format(accuracy * 100.0))

Erro médio absoluto: 0.16
Accuracy: 86.25%
