Train and fine-tune a Decision Tree for the moons dataset.

a. Generate a moons dataset using make_moons(n_samples=10000, noise=0.4).

In [2]:
from sklearn.datasets import make_moons

# Step 1: Generate moons dataset
X, y = make_moons(n_samples=10000, noise=0.4)

b. Split it into a training set and a test set using train_test_split().

In [3]:
from sklearn.model_selection import train_test_split

# Step 2: Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. 

Hint: try various values for max_leaf_nodes.

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Step 3: Grid search for best hyperparameters
param_grid = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
tree_clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(tree_clf, param_grid, n_jobs=-1, verbose=1, cv=3)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


In [5]:
grid_search.best_estimator_

d. Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy.

By default, GridSearchCV trains the best model found on the whole training set (you can change this by setting refit=False), so we don't need to do it again. We can simply evaluate the model's accuracy:

In [7]:
from sklearn.metrics import accuracy_score

y_pred = grid_search.predict(X_test)
accuracy_score(y_test, y_pred)

0.879

In [8]:
best_tree_clf = grid_search.best_estimator_
best_tree_clf.fit(X_train, y_train)

# Step 5: Evaluate performance on test set
accuracy = best_tree_clf.score(X_test, y_test)
print("Accuracy on test set:", accuracy)

Accuracy on test set: 0.879


# Grow with forest

To generate 1,000 subsets of the training set, we can use the ShuffleSplit class from Scikit-Learn. This class generates a user-defined number of independent train/test dataset splits. Here is how we can generate 1,000 subsets of the training set with 100 instances each:

In [6]:
from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100

subsets = []
rs = ShuffleSplit(n_splits=n_trees, test_size=n_instances, random_state=42)

for train_index, _ in rs.split(X_train):
    X_subset = X_train[train_index]
    y_subset = y_train[train_index]
    subsets.append((X_subset, y_subset))

Now we can train one Decision Tree on each subset using the best hyperparameters found in the previous step. Here is how we can train and evaluate 1,000 Decision Trees:

In [39]:
from sklearn.base import clone

forest = []
for subset in subsets:
    tree = clone(grid_search.best_estimator_)
    tree.fit(subset[0], subset[1])
    forest.append(tree)

Next, we generate the predictions of the 1,000 Decision Trees for each test instance and keep the most frequent prediction. We can use SciPy's mode() function to find the most frequent prediction:

In [40]:
import numpy as np
from scipy.stats import mode

y_pred = np.empty([n_trees, len(X_test)])
for tree_idx, tree in enumerate(forest):
    y_pred[tree_idx] = tree.predict(X_test)

y_pred_majority_votes, n_votes = mode(y_pred, axis=0)

Finally, we evaluate the majority-vote predictions on the test set and compare the accuracy with the previous model. The majority-vote predictions should be slightly more accurate than the single Decision Tree. Here is how we can compute the accuracy:

In [41]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))
accuracy

0.7