# **Decision Tree**

**Decision Trees (DTs)** are a non-parametric supervised learning method used for classification and regression. (Source: https://scikit-learn.org/stable/modules/tree.html#)

## **Decision Tree for Classification**

**Question 1 A**

---

D is divided into 2 classes.

The number of samples in two classes is equal.

This means that probability $p_1 = p_2 = 0.5$

Entropy:

$$Entropy = -(0.5 \log_2(0.5)) -(0.5 \log_2(0.5)) = 1$$



**Question 2 C**

---
D = "Raise Salary"

Probability of class 0: $p_0 = \frac{3}{5} = 0.6$

Probability of class 1: $p_1 = \frac{2}{5} = 0.4$

$Gini(D) = 1 - (p_0^2 + p_1^2) = 1 - (0.6^2 + 0.4^2) = 1 - 0.52 = 0.48$

**Question 3 D**

---
D = "Likes English"

**Likes English = 0**: has three samples.
*   The number of samples with **Raise Salary = 0**: 2
*   The number of samples with **Raise Salary = 1**: 1
*   Probability $p_0 = \frac{2}{3}$, $p_1 = \frac{1}{3}$

**Likes English = 1**: has two samples.
*   The number of samples with **Raise Salary = 0**: 1
*   The number of samples with **Raise Salary = 1**: 1
*   Probability $p_0 = \frac{1}{2}$, $p_1 = \frac{1}{2}$

$Gini(Likes English = 0) = 1 - ((\frac{2}{3})^2 + (\frac{1}{3})^2) = \frac{4}{9} \approx 0.44$

$Gini(Likes English = 1) = 1 - ((\frac{1}{2})^2 + (\frac{1}{2})^2) = 0.5$

$Gini(Likes English) = \frac{3}{5} \times \frac{4}{9} + \frac{2}{5} \times 0.5 \approx 0.47$

**Question 4 C**

---
D = "Age <= 26"

**Age <= 26**: consists of two samples.
*   The number of samples with **Raise Salary = 0**: 2
*   The number of samples with **Raise Salary = 1**: 0
*   Probability $p_0 = 1$, $p_1 = 0$


**Age > 26**: consists of three samples.
*   The number of samples with **Raise Salary = 0**: 1
*   The number of samples with **Raise Salary = 1**: 2
*   Probability $p_0 = \frac{1}{3}$, $p_1 = \frac{2}{3}$

Class 1: $Gini(Age <= 26) = 1 - (1^2 + 0^2) = 0$

Class 2: $Gini(Age > 26) = 1 - ((\frac{1}{3})^2 + (\frac{2}{3})^2) = \frac{4}{9} \approx 0.444$

Synthesis: $Gini(Age <= 26) = \frac{2}{5} \times 0 + \frac{3}{5} \times \frac{4}{9} \approx 0.27$



**Question 5 B**

---
D = "Raise Salary"

The number of samples with **Raise Salary = 0**: 3

The number of samples with **Raise Salary = 1**: 2

Probability $p_0 = \frac{3}{5}$, $p_1 = \frac{2}{5}$

$Entropy(Raise\ Salary) = -(\frac{3}{5} \log_2 \frac{3}{5}) -(\frac{2}{5} \log_2 \frac{2}{5}) \approx 0.971$


**Question 6 A**

---
D = "Likes English"

**Likes English = 0**: consists of three samples.
*   The number of samples with **Raise Salary = 0**: 2
*   The number of samples with **Raise Salary = 1**: 1
*   Probability $p_0 = \frac{2}{3}$, $p_1 = \frac{1}{3}$
*   $Entropy(D_1) = -(\frac{2}{3}\log_2\frac{2}{3}) - (\frac{1}{3}\log_2\frac{1}{3}) \approx 0.918$

**Likes English = 1**: consists of two samples.
*   The number of samples with **Raise Salary = 0**: 1
*   The number of samples with **Raise Salary = 1**: 1
*   Probability $p_0 = \frac{1}{2}$, $p_1 = \frac{1}{2}$
*   $Entropy(D_2) = -(\frac{1}{2}\log_2\frac{1}{2}) - (\frac{1}{2}\log_2\frac{1}{2}) = 1$

$Entropy(D) = \frac{3}{5} \times 0.918 + \frac{2}{5} \times 1 \approx 0.9508$

$Gain(Likes\ English) = 1 - Entropy(Likes\ English) = 0.0492$


**Question 7 A**

**Question 8 A**

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Paragraph C :
# Load the diabetes dataset
iris_X, iris_y = datasets.load_iris(return_X_y=True)
# Split train : test = 8:2
X_train , X_test , y_train , y_test = train_test_split (
    iris_X,
    iris_y,
    test_size =0.2,
    random_state =42,
)

# Paragraph B :
# Define model
dt_classifier = DecisionTreeClassifier()

#Paragraph A :
# Train
dt_classifier.fit(X_train, y_train)

# Paragraph D :
# Preidct and evaluate
y_pred = dt_classifier.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

## **Decision Tree for Regression**

**Question 9 A**

---
D = "Likes AI"

Likes AI = 0: consists of three samples.
*   Salary is 200, 300, and 400
*   $Mean_0 = \frac{200 + 300 + 400}{3} = 300$
*   $SSE(D_0) = \frac{(200 - 300)^2 + (300 - 300)^2 + (400 - 300)^2}{4} = 6667$

Likes AI = 1: consists of two samples.
*   Salary is 400 and 900
*   $Mean_0 = \frac{400 + 500}{2} = 450$
*   $SSE(D_1) = \frac{(400 - 450)^2 + (500 - 450)^2}{2} = 2500$

$SSE(D) = SSE(D_0) + SSE(D_1)= 9167$

**Question 10 C**

---
$D_1$: Age <= 24
*   $Mean_{D_1} = \frac{200}{1} = 200$
*   $SSE_{D_1} = \frac{(200 - 200)^2}{2} = 0$

$D_2$: Age > 24
*   $Mean_{D_1} = \frac{400 + 300 + 500 + 400}{4} = 400$
*   $SSE_{D_1} = \frac{(400 - 400)^2 + (300 - 400)^2 + (500 - 400)^2 + (400 - 400)^2}{4} = 5000$

$SSE(Age <= 24) = SSE(D_1) + SSE(D_2)= 5000$

**Question 11 A**

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Paragraph C :
# Load dataset
machine_cpu = fetch_openml(name="machine_cpu")
machine_data = machine_cpu.data
machine_labels = machine_cpu.target
# Split train : test = 8:2
X_train , X_test , y_train , y_test = train_test_split(
    machine_data,
    machine_labels,
    test_size=0.2,
    random_state=42
)

# Paragraph B :
# Define model
tree_reg = DecisionTreeRegressor()

# Paragraph A :
# Train
tree_reg.fit(X_train, y_train)

# Paragraph D :
# Preidct and evaluate
y_pred = tree_reg.predict(X_test)
mean_squared_error(y_test, y_pred)

  warn(
  warn(


9180.199074074075