# Decison Trees

First we'll load some fake data on past hires I made up. Note how we use pandas to convert a csv file into a DataFrame:

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree

input_file = "PastHires.csv" #importa o arquivo .csv
df = pd.read_csv(input_file, header = 0) #cria um objeto de quadro de dados a partir do arquivo

In [2]:
df.head() #ultima coluna tendo a resposta final

Unnamed: 0,Years Experience,Employed?,Previous employers,Level of Education,Top-tier school,Interned,Hired
0,10,Y,4,BS,N,N,Y
1,0,N,0,BS,Y,Y,Y
2,7,N,6,BS,N,N,N
3,2,Y,1,MS,Y,N,Y
4,20,N,2,PhD,Y,N,N


scikit-learn needs everything to be numerical for decision trees to work. So, we'll map Y,N to 1,0 and levels of education to some scale of 0-2. In the real world, you'd need to think about how to deal with unexpected or missing data! By using map(), we know we'll get NaN for unexpected values.

In [4]:
d = {'Y': 1, 'N': 0} #dicionario que mapeia a letra y para o numero 1 e n para 0
df['Hired'] = df['Hired'].map(d #coluna contradad do quadro de dados, percorrendo todo o arquivo e trocando as letras pelos numeros citados
df['Employed?'] = df['Employed?'].map(d) #coluna ensino superior,  fazendo a mesma coisa
df['Top-tier school'] = df['Top-tier school'].map(d)# mesma coisa para esta coluna
df['Interned'] = df['Interned'].map(d)#e tambem e trocado na coluna interned o y por 1 e n por 0
d = {'BS': 0, 'MS': 1, 'PhD': 2}
df['Level of Education'] = df['Level of Education'].map(d)
df.head()

SyntaxError: '(' was never closed (1865332842.py, line 2)

Next we need to separate the features from the target column that we're trying to bulid a decision tree for.

In [None]:
features = list(df.columns[:6 #extrair a lista das 6 primeiras colunas de nomes de recursos
features

Now actually construct the decision tree:

In [None]:
y = df["Hired"] #coluna de contraados (queremos prever)
X = df[features] #tem todos os dados de todas as colunas
clf = tree.DecisionTreeClassifier() #criar o classificador
clf = clf.fit(X,y) #ajustas o classificador para os dados e repostas

... and display it. Note you need to have pydotplus installed for this to work. (!pip install pydotplus)

To read this decision tree, each condition branches left for "true" and right for "false". When you end up at a value, the value array represents how many samples exist in each target value. So value = [0. 5.] mean there are 0 "no hires" and 5 "hires" by the tim we get to that point. value = [3. 0.] means 3 no-hires and 0 hires.

In [None]:
from IPython.display import Image  
from io import StringIO  
import pydotplus

dot_data = StringIO()  #cria um buffer de string para armazenar os dados no formato DOT
tree.export_graphviz(clf, out_file=dot_data,  #exporrta o modelo de árvore (clf) para o formato DOT
                     feature_names=features)  #define os nomes das features 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  # Cria um grafo a partir dos dados DOT gerados
Image(graph.create_png())  #recanderiza o grafo em formato PNG e o exibe no notebook

## Ensemble learning: using a random forest

We'll use a random forest of 10 decision trees to predict employment of specific candidate profiles:

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, y) #conjunto de recursos de colunas que esta tentando prever

#Predict employment of an employed 10-year veteran
print (clf.predict([[10, 1, 4, 0, 0, 0]]))
#...and an unemployed 10-year veteran
print (clf.predict([[10, 0, 4, 0, 0, 0]]))

## Activity

Modify the test data to create an alternate universe where everyone I hire everyone I normally wouldn't have, and vice versa. Compare the resulting decision tree to the one from the original data.

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_text

X = [[10, 1, 4, 0, 0, 0], [10, 0, 4, 0, 0, 0]]  # Exemplos de características
y = [1, 0]  # Variável alvo (onde 1 representa 'contratar' e 0 representa 'não contratar')

# Inverter a variável alvo (y)
y_reversed = [1 - yi for yi in y]  # Inverte os valores de 0 para 1 e de 1 para 0

# Treinar o modelo com os valores de y invertidos
clf_reversed = RandomForestClassifier(n_estimators=10, random_state=42)
clf_reversed = clf_reversed.fit(X, y_reversed)

# Visualizar as árvores de decisão original e invertida
print("Árvore de Decisão do Random Forest Original:")
# Exibe a árvore de decisão do modelo original (uma das árvores do random forest)
print(export_text(clf_reversed.estimators_[0], feature_names=["Feature1", "Feature2", "Feature3", "Feature4", "Feature5", "Feature6"]))

# Comparar as previsões entre os modelos original e invertido
print("\nComparação das previsões para os dados de teste:")
# Fazendo previsões com o modelo invertido
reversed_predictions = clf_reversed.predict([[10, 1, 4, 0, 0, 0], [10, 0, 4, 0, 0, 0]])

print("Previsões Invertidas:", reversed_predictions)


Árvore de Decisão do Random Forest Original:
|--- class: 1.0


Comparação das previsões para os dados de teste:
Previsões Invertidas: [0 1]
