#Modelos de Classificação: Árvore de Decisão

### Importando libs  e funções

Importando libs

In [0]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, recall_score, accuracy_score, precision_score
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold

### Etapa de exploração e tratamento dos dados

Importando os dados. Esse dataset contém dados relacionados a atributos de  vidros. Portanto, o objetivo é classificar corretament os tipos de vidros (Vidro de carro, Prédios, etc.) a partir de atributos relacionados a índice de refração, percentagem de diversos atributos químicos presentes como: potássio, cálcio, etc.

Mais informações a respeito do dataset: [UCL](https://archive.ics.uci.edu/ml/datasets/Glass+Identification)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/intelligentagents/aprendizagem-supervisionada/master/data/glass.csv')

Descrevendo o dataset

In [0]:
# Exporando o dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
id       214 non-null int64
 ri      214 non-null float64
 na      214 non-null float64
 mg      214 non-null float64
 al      214 non-null float64
 si      214 non-null float64
 k       214 non-null float64
 ca      214 non-null float64
 ba      214 non-null float64
 fe      214 non-null float64
 type    214 non-null int64
dtypes: float64(9), int64(2)
memory usage: 18.5 KB


In [0]:
# Visualizando o sumário das colunas numéricas do dataset
df.describe()

Unnamed: 0,id,ri,na,mg,al,si,k,ca,ba,fe,type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,107.5,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,61.920648,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.0,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,54.25,1.516523,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,107.5,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,160.75,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,214.0,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


Visualizando o dataset

In [0]:
df.head(10)

Unnamed: 0,id,ri,na,mg,al,si,k,ca,ba,fe,type
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1
5,6,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.0,0.26,1
6,7,1.51743,13.3,3.6,1.14,73.09,0.58,8.17,0.0,0.0,1
7,8,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0.0,0.0,1
8,9,1.51918,14.04,3.58,1.37,72.08,0.56,8.3,0.0,0.0,1
9,10,1.51755,13.0,3.6,1.36,72.99,0.57,8.4,0.0,0.11,1


Definindo as variáveis dependentes/independentes:

In [0]:
X = df.iloc[:, :10].values
y = df.iloc[:, -1].values

### Etapa de Treinamento e Avaliação do Modelo (1ª Maneira)

Definindo as métricas a serem utilizadas:

In [0]:
metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']

Treinando o modelo de Árvore de Decisão usando a validação cruzada com 10 folds:

In [0]:
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

scores = cross_validate(classifier, X, y, cv=10, scoring=metrics)



Visualizando os resultados:

In [0]:
pd.DataFrame.from_dict(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision_macro,test_recall_macro,test_f1_macro
0,0.003749,0.005019,0.608696,0.572727,0.472222,0.482194
1,0.001794,0.003189,1.0,1.0,1.0,1.0
2,0.001664,0.003127,1.0,1.0,1.0,1.0
3,0.001663,0.003273,1.0,1.0,1.0,1.0
4,0.001704,0.003109,1.0,1.0,1.0,1.0
5,0.001649,0.003141,1.0,1.0,1.0,1.0
6,0.001684,0.003137,1.0,1.0,1.0,1.0
7,0.001632,0.00318,1.0,1.0,1.0,1.0
8,0.00163,0.003124,1.0,1.0,1.0,1.0
9,0.001511,0.003154,0.666667,0.764286,0.828571,0.73974


Exibindo os valores da media das métricas:

In [0]:
pd.DataFrame.from_dict(scores).mean()

fit_time                0.001868
score_time              0.003346
test_accuracy           0.927536
test_precision_macro    0.933701
test_recall_macro       0.930079
test_f1_macro           0.922193
dtype: float64

### Etapa de Treinamento e Avaliação do Modelo (2ª Maneira)

Defininndo o numero de folds: 5. Lembrando que também iremos realizar um "shuffle" nos dados:

In [0]:
kf = KFold(n_splits=5, shuffle= True)

Define o modelo a ser treinado:

In [0]:
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

Gerando um dataframe que irá guardar os resultados finais das métricas:

In [0]:
df_results = pd.DataFrame(columns=['iteration', 'accuracy', 'precision', 'recall', 'f-measure'], index=None)

Realizando o treinamento do classificador usando validação cruzada:

In [0]:
itera=0
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
 
    # Treina o modelo a cada iteração com o conjunto de treinamento:
    classifier.fit(X_train, y_train)

    # Prevendo os resultados do modelo criado com o conjunto de testes
    y_pred = classifier.predict(X_test)
       
    # Armazenando os valores das métricas a cada iteração com o conjunto de teste em um df
    itera += 1
    df_results.loc[len(df_results), :] = [itera, accuracy_score(y_test, y_pred), precision_score (y_test, y_pred, average = 'macro'),
                   recall_score(y_test, y_pred,  average = 'macro'), f1_score(y_test, y_pred,  average = 'macro')]
    

Criando um dataframe contendoo valor das métricas a cada iteração:

In [0]:
df_results

Unnamed: 0,iteration,accuracy,precision,recall,f-measure
0,1,0.976744,0.986111,0.966667,0.974235
1,2,0.976744,0.96,0.933333,0.937778
2,3,0.976744,0.981481,0.990741,0.985434
3,4,1.0,1.0,1.0,1.0
4,5,0.97619,0.958333,0.944444,0.942857
5,1,1.0,1.0,1.0,1.0
6,2,1.0,1.0,1.0,1.0
7,3,1.0,1.0,1.0,1.0
8,4,0.953488,0.944444,0.944444,0.933333
9,5,0.97619,0.989583,0.988889,0.988877


Exibindo a média das metricas:

In [0]:
df_results.mean()

iteration    3.000000
accuracy     0.983610
precision    0.981995
recall       0.976852
f-measure    0.976251
dtype: float64