# Decision Tree Classifiers

## Introduction

In presentation... 

## Decision trees with ScikitLearn

Lets create a decision tree using the popular Iris dataset.

In [11]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Cargar los datos de la base de datos Iris
iris = load_iris()
X = iris.data
y = iris.target

# Dividir los datos en un conjunto de entrenamiento y un conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Crear el modelo del árbol de decisión
clf = DecisionTreeClassifier()

# Entrenar el modelo
clf = clf.fit(X_train, y_train)

Now, lets show the obtained tree

In [13]:
print(iris.target_names)
tree_rules = export_text(clf, feature_names=iris.feature_names)
print(tree_rules)

['setosa' 'versicolor' 'virginica']
|--- petal length (cm) <= 2.60
|   |--- class: 0
|--- petal length (cm) >  2.60
|   |--- petal width (cm) <= 1.65
|   |   |--- petal length (cm) <= 5.00
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  5.00
|   |   |   |--- sepal length (cm) <= 6.05
|   |   |   |   |--- class: 1
|   |   |   |--- sepal length (cm) >  6.05
|   |   |   |   |--- class: 2
|   |--- petal width (cm) >  1.65
|   |   |--- petal length (cm) <= 4.85
|   |   |   |--- sepal width (cm) <= 3.10
|   |   |   |   |--- class: 2
|   |   |   |--- sepal width (cm) >  3.10
|   |   |   |   |--- class: 1
|   |   |--- petal length (cm) >  4.85
|   |   |   |--- class: 2



And calculate its accuracy

In [3]:
# Predecir el conjunto de prueba
y_pred = clf.predict(X_test)

# Calcular la precisión del modelo
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Precisión: 0.9555555555555556


### Limiting the minimum number of elements in leaf

In [15]:
# Crear el modelo del árbol de decisión
clf = DecisionTreeClassifier(min_samples_leaf=5)

# Entrenar el modelo
clf = clf.fit(X_train, y_train)

tree_rules = export_text(clf, feature_names=iris.feature_names)
print(tree_rules)

|--- petal length (cm) <= 2.60
|   |--- class: 0
|--- petal length (cm) >  2.60
|   |--- petal width (cm) <= 1.65
|   |   |--- petal length (cm) <= 4.85
|   |   |   |--- class: 1
|   |   |--- petal length (cm) >  4.85
|   |   |   |--- class: 1
|   |--- petal width (cm) >  1.65
|   |   |--- sepal length (cm) <= 5.95
|   |   |   |--- class: 2
|   |   |--- sepal length (cm) >  5.95
|   |   |   |--- class: 2



In [16]:
# Predecir el conjunto de prueba
y_pred = clf.predict(X_test)

# Calcular la precisión del modelo
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9555555555555556


This tree is simpler, but as accurate as the original one:
- The removed branches constains information particular to the training set

### Limiting the tree height

In [29]:
# Crear el modelo del árbol de decisión
clf = DecisionTreeClassifier(max_depth=2)

# Entrenar el modelo
clf = clf.fit(X_train, y_train)

tree_rules = export_text(clf, feature_names=iris.feature_names)
print(tree_rules)

|--- petal width (cm) <= 0.80
|   |--- class: 0
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.65
|   |   |--- class: 1
|   |--- petal width (cm) >  1.65
|   |   |--- class: 2



In [30]:
# Predecir el conjunto de prueba
y_pred = clf.predict(X_test)

# Calcular la precisión del modelo
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9555555555555556


The tree is more simple, and it is as accurate as before. It has limits ...

In [32]:
# Crear el modelo del árbol de decisión
clf = DecisionTreeClassifier(max_depth=1)

# Entrenar el modelo
clf = clf.fit(X_train, y_train)

tree_rules = export_text(clf, feature_names=iris.feature_names)
print(tree_rules)

|--- petal length (cm) <= 2.60
|   |--- class: 0
|--- petal length (cm) >  2.60
|   |--- class: 2



In [34]:
# Predecir el conjunto de prueba
y_pred = clf.predict(X_test)

# Calcular la precisión del modelo
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6


In [35]:
y_pred = clf.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
print("Accuracy on Training:", accuracy)

Accuracy on Training: 0.6952380952380952


Now, we have a clear example of **underfitting**