Grow a forest by following three steps. 
- continuing the previous exercise, generate 1000 data sets of training set, each containing hundred instances selected randomly. You can use `Scikit-learn's` `ShuffleSplit` class for this. 
- training one decision tree on each subset using a best hyperparameter value found in the previous exercise evaluating these 1000 decision trees on the test set. Since they are trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy.
- See, now comes the magic. For each test set instance generate a prediction of 1000 decision trees and kept only the most frequent prediction. You can use `SciPy` `mode()` function for this. This approach gives you majority vote prediction over the test set. 
- Evaluate these predictions on the test set. You should obtain a slightly higher accuracy than your first model. About 0.5 to 15% higher. Congratulations, you have trained a random forest classifier.

https://github.com/ageron/handson-ml2/blob/master/06_decision_trees.ipynb

In [59]:
%pip install scikit-learn pandas numpy

Note: you may need to restart the kernel to use updated packages.




In [60]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit

In [61]:
X,y = make_moons(n_samples=10000, noise=0.1, random_state=42)

In [62]:
X, y

(array([[ 0.45549318, -0.12550304],
        [-0.70421731,  0.04130827],
        [ 0.41379864,  0.79132194],
        ...,
        [-0.03803096,  0.24540791],
        [ 0.8698024 ,  0.52329471],
        [ 1.16429638, -0.3798397 ]]),
 array([1, 0, 0, ..., 1, 0, 1]))

In [63]:
train = ShuffleSplit(n_splits=1000, test_size=100, random_state=42)

In [64]:
subsets=[]
for train_index, test_index in train.split(X):
    X_subset = X[test_index]
    y_subset = y[test_index]
    subsets.append((X_subset, y_subset))

In [65]:
print("First subset shape (X):", subsets[0][0].shape)  # Should be (100, 1)
print("First subset shape (y):", subsets[0][1].shape)

First subset shape (X): (100, 2)
First subset shape (y): (100,)


In [70]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np

In [75]:
param_grid= {
    'max_leaf_nodes': [30],
}

sum_accuracy=0

for sets in subsets:
    clf= DecisionTreeClassifier(random_state=42)
    grid_search= GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(sets[0], sets[1])
    best_model = grid_search.best_estimator_   
    ran= np.random.randint(0,1000)
    test_acc = accuracy_score(subsets[ran][1], best_model.predict(subsets[ran][0]))
    sum_accuracy= sum_accuracy+ test_acc
    print(test_acc)


0.98
0.99
0.99
0.96
0.99
0.97
0.92
0.97
0.91
0.94
0.97
0.88
0.99
0.99
0.94
0.91
0.95
0.99
0.92
0.95
0.96
0.89
0.98
0.97
0.97
0.96
0.92
1.0
0.98
0.92
0.97
1.0
0.9
0.93
0.97
0.97
0.95
1.0
0.94
1.0
0.94
0.92
0.97
0.96
1.0
0.99
0.94
0.94
0.97
0.97
0.9
0.98
0.98
0.89
0.89
0.98
0.96
0.95
0.97
1.0
0.96
1.0
0.88
0.88
0.9
0.99
0.97
0.96
0.93
1.0
0.95
0.98
1.0
0.95
0.95
0.97
0.89
0.9
0.98
0.96
0.91
0.93
0.96
0.94
0.99
0.99
0.94
0.91
0.96
0.89
0.91
0.93
0.97
0.99
0.98
0.97
0.9
0.98
0.9
0.96
0.99
0.96
0.93
0.97
0.96
0.94
0.98
0.9
0.94
0.91
1.0
1.0
0.93
0.91
0.92
0.98
0.96
0.96
0.92
0.9
0.85
0.98
0.89
0.95
0.99
0.99
0.94
0.97
0.95
0.93
0.99
0.97
0.99
0.98
0.98
0.98
0.94
0.95
0.95
0.98
0.96
0.89
0.97
0.98
0.93
0.94
0.99
0.95
0.96
0.97
0.95
0.92
0.97
0.97
0.95
0.95
0.98
0.96
0.98
0.89
0.96
0.91
0.89
0.94
0.94
0.93
0.91
0.98
0.98
0.91
0.99
0.98
0.89
0.94
0.96
0.98
0.95
0.97
0.97
0.96
0.98
0.88
0.9
0.96
0.97
0.98
0.99
0.93
1.0
0.96
0.91
0.87
0.97
0.92
0.98
0.93
0.95
0.93
0.91
0.93
0.99
0.94
0.92
0.93
0

In [77]:
print(sum_accuracy/1000)

0.9510200000000056
