## Decision Trees

1. Train and fine-tune a Decision Tree for the moons' dataset.

(a) Generate a moons' dataset using make_moons(n_samples=10000, noise=0.4).

In [18]:
from sklearn.datasets import make_moons

X, y= make_moons(n_samples=10000, noise=0.4, random_state=0)

(b) Split it into a training set and a test set using train_test_split().

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

(c) Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyper-parameter values
for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

tree_parameters = {'criterion':['gini','entropy'], 'max_leaf_nodes': list(range(2, 50))}
clf = GridSearchCV(DecisionTreeClassifier(random_state=0), tree_parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_estimator_)

DecisionTreeClassifier(max_leaf_nodes=36, random_state=0)


Following the hint I optimized on the maximum number of leaf nodes, other parameters could
also be taken into account.

(d) Train it on the full training set using these hyper-parameters, and measure your model’s performance
on the test set.

In [21]:
from sklearn.metrics import accuracy_score

best_tree = DecisionTreeClassifier(max_leaf_nodes=36, random_state=0)
best_tree.fit(X_train, y_train)
y_predicted = best_tree.predict(X_test)
print(f"The model's accuracy score is: {accuracy_score(y_test, y_predicted)}")

The model's accuracy score is: 0.868


2. Grow a forest.

(a) Continuing the previous exercise, generate 1; 000 subsets of the training set,
each containing 100 instances selected randomly.
Hint: you can use Scikit- Learn’s ShuffleSplit class for this.

In [22]:
from sklearn.model_selection import ShuffleSplit

random_shuffle = ShuffleSplit(n_splits=1000, train_size=0.0134, random_state=0)

X_subsets, y_subsets = {}, {}
for i, indexes in enumerate(random_shuffle.split(X_train)):
    X_subsets[i] = X_train[indexes[0], :]
    y_subsets[i] = y_train[indexes[0]]

(b) Train one Decision Tree on each subset, using the best hyper-parameter values
found above. Evaluate these 1; 000 Decision Trees on the test set.

In [23]:
import numpy as np

best_tree = DecisionTreeClassifier(max_leaf_nodes=36, random_state=0)
scores = []
for subset in range(len(X_subsets)):
    best_tree.fit(X_subsets[subset], y_subsets[subset])
    y_predict = best_tree.predict(X_test)
    scores.append(accuracy_score(y_test, y_predict))

print(np.mean(scores))

0.7941336


(c) For each test set instance, generate the predictions of the 1,000 Decision
Trees, and keep only the most frequent prediction (you can use SciPy’s mode()
function for this). This gives you majority-vote predictions over the test set.

In [24]:
from scipy.stats import mode

predictions: list[int] = []
for i in range (len(X_subsets)):
    predictions.append(best_tree.fit(X_subsets[i], y_subsets[i]).predict(X_test))

y_predict_vote, vote_count = mode(predictions)

(d) Evaluate these predictions on the test set. Which are the performances compared
to the single model trained in point 1.

In [25]:
from sklearn.metrics import confusion_matrix
print(f"The model's accuracy score is: {accuracy_score(y_test, y_predict_vote.T)}\n")
print("Confusion Matrix standard: \n", confusion_matrix(y_test, y_predicted),"\n")
print("Confusion Matrix voting: \n", confusion_matrix(y_test, y_predict_vote.T),"\n")

The model's accuracy score is: 0.8692

Confusion Matrix standard: 
 [[1107  158]
 [ 172 1063]] 

Confusion Matrix voting: 
 [[1103  162]
 [ 165 1070]] 



Although the improvement is not large, we get a better result with majority voting, as expected from the class theory.

Bonus random forest just for curiosity.

In [26]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

param_grid = {'max_leaf_nodes': [30, 40, 50, 60, 70, 80]}
base_estimator = RandomForestClassifier(random_state=0)
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5, factor=2,
                         resource='n_estimators',max_resources=30).fit(X_train, y_train)
best_random_forest = sh.best_estimator_
print(best_random_forest)

RandomForestClassifier(max_leaf_nodes=60, n_estimators=28, random_state=0)


In [27]:
y_predict_best_random_forest = best_random_forest.predict(X_test)
print(f"The model's accuracy score is: {accuracy_score(y_test, y_predict_best_random_forest)}\n")

The model's accuracy score is: 0.8724



In [30]:
from sklearn.ensemble import ExtraTreesClassifier
param_grid = {'max_leaf_nodes': [30, 40, 50, 60, 70, 80]}
base_estimator = ExtraTreesClassifier(random_state=0)
sh = HalvingGridSearchCV(base_estimator, param_grid, cv=5, factor=2,
                         resource='n_estimators',max_resources=30).fit(X_train, y_train)
y_predict_extra = best_random_forest.predict(X_test)
print(f"The model's accuracy score is: {accuracy_score(y_test, y_predict_extra)}\n")

The model's accuracy score is: 0.8724

