<a href="https://colab.research.google.com/github/lalit-jamdagnee/Learning_ML/blob/main/Exercise_DecisionTrees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Train and fine-tune a Decision Tree for the moon dataset having 10000 instances.

In [1]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [2]:
# Let's see the shape of features and labels
print("X shape: ", X.shape)
print("Y shape: ", y.shape)

X shape:  (10000, 2)
Y shape:  (10000,)


In [3]:
# spliting the dataset into training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [21]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier()

In [22]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [2, 3, 5],
    "max_leaf_nodes": [5, 10, 20]
}

grid_search = GridSearchCV(tree_clf, param_grid, cv=5, scoring="accuracy")

In [23]:
# Let's train it on the partial data and see how it performs

grid_search.fit(X_train[:1000], y_train[:1000])

In [24]:
print("Best parameter: ", grid_search.best_params_)
print("Best estimator: ", grid_search.best_estimator_)

Best parameter:  {'max_depth': 2, 'max_leaf_nodes': 5}
Best estimator:  DecisionTreeClassifier(max_depth=2, max_leaf_nodes=5)


In [25]:
y_pred = grid_search.best_estimator_.predict(X_test)

In [26]:
from sklearn.metrics import accuracy_score

print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  0.85


Our model is giving an accuracy of 85%. Let's train the model on the complete training set this time.

In [27]:
grid_search.fit(X_train, y_train)

In [28]:
print("Best parameter: ", grid_search.best_params_)
print("Best estimator: ", grid_search.best_estimator_)

Best parameter:  {'max_depth': 2, 'max_leaf_nodes': 5}
Best estimator:  DecisionTreeClassifier(max_depth=2, max_leaf_nodes=5)


In [30]:
y_pred = grid_search.best_estimator_.predict(X_test)
print("Accuracy score: ", round(accuracy_score(y_test, y_pred), 2))

Accuracy score:  0.86


Here we can see that our accuracy has increased slightly from 85% to 86%.

2. Grow a forest.

a. Continuing the previous exercise, generate 1,000 subsets of the training set,each containing 100 instances selected randomly.

In [31]:
from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits = n_trees, test_size = len(X_train) - n_instances,
                  random_state = 42)

In [35]:
for mini_train_index, mini_test_index in rs.split(X_train):
  X_mini_train = X_train[mini_train_index]
  y_mini_train = y_train[mini_train_index]
  mini_sets.append((X_mini_train, y_mini_train))

b. Train one Decsion Tree on each subset using the best hyperparameter values found above. Evaluate these 1,000 Decsion Trees on the test set.

In [37]:
from sklearn.base import clone

forest = [clone(grid_search.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
  tree.fit(X_mini_train, y_mini_train)

  y_pred = tree.predict(X_test)
  accuracy_scores.append(accuracy_score(y_test, y_pred))

In [38]:
import numpy as np

print("Average accuracy score: ", round(np.mean(accuracy_scores), 2))

Average accuracy score:  0.82


c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction

In [39]:
Y_pred = np.empty([n_trees, len(X_test)], dtype = np.uint8)

for tree_index, tree in enumerate(forest):
  Y_pred[tree_index] = tree.predict(X_test)

In [40]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis = 0)

d. Evaluate these predictions on the test set:

In [41]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.8615