# U3 Classification trees

**2023_01_26_Question 3:** $\;$ Consider the classification tree learning algorithm applied to a four-class problem, $c=1,2,3,4$. The algorithm has reached a node $t$ with the following data: $2$ from class $1$, $16$ from $2$, $8$ from $3$ and $256$ from $4$. The impurity of $t$, $\mathcal{I}(t)$, measured as the entropy of the class posterior probability given by the empirical distribution in $t$, is:
1. $0.00\leq \mathcal{I}(t)<0.50$
2. $0.50\leq \mathcal{I}(t)<1.00$
3. $1.00\leq \mathcal{I}(t)<1.50$
4. $1.50\leq \mathcal{I}(t)$

**Solution:** $\;$ option 2.
$$\begin{align*}
\mathcal{I}(t) &= -\sum_{c=1}^4 \hat{P}(c\mid t)\,\log_2 \hat{P}(c\mid t)\\
&= -\frac{2}{282}\log_2 \frac{2}{282} -\frac{16}{282}\log_2 \frac{16}{282} -\frac{8}{282}\log_2 \frac{8}{282}-\frac{256}{282}\log_2 \frac{256}{282}=0.56
\end{align*}$$

**2023_01_17_Question 5:** $\;$ Consider the classification tree learning algorithm applied to a two-class problem, $c=A,B$. The algorithm has reached a node $t$ whose impurity, measured as the entropy of the class posterior probability given by the empirical distribution in $t$, is $I=0.72$. What is the number of samples in each of the classes in node $t$?
1. 2 in class A and 32 in class B
2. 2 in class A and 16 in class B
3. 4 in class A and 32 in class B
4. 4 in class A and 16 in class B

**Solution:** $\;$ option 4.

**2023_01_17_Question 6:** 

In [1]:
import numpy as np; import matplotlib.pyplot as plt;
fig = plt.figure(figsize=(2, 2)); plt.xlim([0, 6]); plt.ylim([0, 6])
ticks = np.arange(0, 7); plt.xticks(ticks); plt.yticks(ticks); plt.grid()
X = np.array([[1,1], [1,5], [2,1], [2,3], [3,1], [3,5], [4,1], [5,1], [1,3], [2,5], [3,3], [5,3], [5,4], [5,5]])
y = np.array([1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
plt.scatter(*X.T, c=y, cmap=plt.cm.binary, edgecolors='black')
plt.plot((0,6),(2,2), (0,4),(4,4), (4,4),(2,6), color='black')
plt.savefig('2023_01_17_Q6.svg', format='svg'); plt.close(fig)

In [2]:
import graphviz
graphviz.Source('''digraph { rankdir=TB splines=false node[shape=oval margin=0.02 width=0 height=0]
A[label=<x2 &le; 2>] L[label=" " shape=circle] R[label=<x1 &le; 4>] A->{L, R}
RL[label=<x2 &le; 4>] RR[label=" " shape=circle style=filled color=black] R->{RL, RR}
RLL[label=" " shape=circle style=filled color=black] RLR[label=" " shape=circle] RL->{RLL, RLR}
}''').render(filename='2023_01_17_Q61', format='svg')
graphviz.Source('''digraph { rankdir=TB splines=false node[shape=oval margin=0.02 width=0 height=0]
A[label=<x2 &le; 4>] L[label=" " shape=circle] R[label=<x1 &le; 4>] A->{L, R}
RL[label=<x2 &le; 2>] RR[label=" " shape=circle style=filled color=black] R->{RL, RR}
RLL[label=" " shape=circle style=filled color=black] RLR[label=" " shape=circle] RL->{RLL, RLR}
}''').render(filename='2023_01_17_Q62', format='svg')
graphviz.Source('''digraph { rankdir=TB splines=false node[shape=oval margin=0.02 width=0 height=0]
A[label=<x2 &le; 2>] L[label=" " shape=circle] R[label=<x1 &le; 4>] A->{L, R}
RL[label=<x2 &le; 4>] RR[label=" " shape=circle] R->{RL, RR}
RLL[label=" " shape=circle style=filled color=black] RLR[label=" " shape=circle] RL->{RLL, RLR}
}''').render(filename='2023_01_17_Q63', format='svg')
graphviz.Source('''digraph { rankdir=TB splines=false node[shape=oval margin=0.02 width=0 height=0]
A[label=<x2 &le; 2>] L[label=" " shape=circle] R[label=<x1 &le; 4>] A->{L, R}
RL[label=<x2 &le; 4>] RR[label=" " shape=circle style=filled color=black] R->{RL, RR}
RLL[label=" " shape=circle] RLR[label=" " shape=circle style=filled color=black] RL->{RLL, RLR}
}''').render(filename='2023_01_17_Q64', format='svg');

The figure below shows a two-class 2d dataset along with a partition of the space into $4$ regions, as well as four possible classification trees. Which of the four is consistent with the data and partition represented?

<div align=center>
<table><tr>
<td style="border: none; text-align:center;">

**Dataset and partition**

<img src="2023_01_17_Q6.svg" width="250"/></td>
<td style="border: none; text-align:center;">

**Tree 1**

<img src="2023_01_17_Q61.svg" width="150"/></td>
<td style="border: none; text-align:center;">

**Tree 2**

<img src="2023_01_17_Q62.svg" width="150"/></td>
<td style="border: none; text-align:center;">

**Tree 3**

<img src="2023_01_17_Q63.svg" width="150"/></td>
<td style="border: none; text-align:center;">

**Tree 4**

<img src="2023_01_17_Q64.svg" width="150"/></td>
<tr></table>
</div>


**Solution:** $\;$ Tree 1. 

**2022_01_27_Question 1:** $\;$ Given the following 3 nodes of a classification tree with samples belonging to 3 classes:
$$\begin{array}{c|ccc}
c & 1 & 2 & 3\\\hline
n_{1} & 2/12 & 5/12 & 5/12\\
n_{2} & 3/11 & 4/11 & 4/11\\
n_{3} & 5/11 & 3/11 & 3/11\\
\end{array}$$
where each row indicates the "posterior" probability of each class at the node. Which of the following inequalities is true?
1. $\mathcal{I}(n_{1}) < \mathcal{I}(n_{3}) < \mathcal{I}(n_{2})$
2. $\mathcal{I}(n_{3}) < \mathcal{I}(n_{2}) < \mathcal{I}(n_{1})$
3. $\mathcal{I}(n_{1}) < \mathcal{I}(n_{2}) < \mathcal{I}(n_{3})$
4. $\mathcal{I}(n_{2}) < \mathcal{I}(n_{3}) < \mathcal{I}(n_{1})$

**Solution:** $\;$ option 1.

**2022_01_13_Question 5:** Given the following dataset used to train a classification tree with 5 two-dimensional samples belonging to 2 classes:
$$\begin{array}{c|cccc}
n & 1 & 2 & 3 & 4 & 5\\\hline
x_{n1} & 2 & 5 & 4 & 3 & 3\\
x_{n2} & 4 & 1 & 2 & 5 & 2\\
c_n & 1 & 1 & 2 & 2 & 2\\
\end{array}$$
How many different partitions could be generated on the root node? Do not consider partitions where all data is assigned to the same child node.
1. $7$
2. $8$
3. $5$
4. $6$

**Solution:** $\;$ option 4.