# Bagging and random forest from scratch

## Bagging for classification
First, we can use the
make classification() function to create a synthetic binary classification problem with 1,000
examples and 20 input features. The complete example is listed below.

In [1]:
# synthetic binary classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=5)
# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


We can  use the Bagging model as a final model and make predictions for classification.
First, the Bagging ensemble is fit on all available data, then the predict() function can be
called to make predictions on new data. The example below demonstrates this on our binary
classification dataset.

In [2]:
# make predictions using bagging for classification
from sklearn.datasets import make_classification
from sklearn.ensemble import BaggingClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=5)
# define the model
model = BaggingClassifier()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [-4.7705504, -1.88685058, -0.96057964, 2.53850317, -6.5843005, 3.45711663,
-7.46225013, 2.01338213, -0.45086384, -1.89314931, -2.90675203, -0.21214568,
-0.9623956, 3.93862591, 0.06276375, 0.33964269, 4.0835676, 1.31423977, -2.17983117,
3.1047287]
yhat = model.predict([row])
# summarize the prediction
print('Predicted Class: %d' % yhat[0])

Predicted Class: 1


##  Bagging for regression

First, we can use the
make regression() function to create a synthetic regression problem with 1,000 examples and
20 input features. The complete example is listed below.

In [3]:
# synthetic regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1,
random_state=5)
# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


We can also use the Bagging model as a final model and make predictions for regression.
First, the Bagging ensemble is fit on all available data, then the predict() function can be
called to make predictions on new data. The example below demonstrates this on our regression
dataset.

In [6]:
# bagging ensemble for making predictions for regression
from sklearn.datasets import make_regression
from sklearn.ensemble import BaggingRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1,
random_state=5)
# define the model
model = BaggingRegressor()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [0.88950817, -0.93540416, 0.08392824, 0.26438806, -0.52828711, -1.21102238,
-0.4499934, 1.47392391, -0.19737726, -0.22252503, 0.02307668, 0.26953276, 0.03572757,
-0.51606983, -0.39937452, 1.8121736, -0.00775917, -0.02514283, -0.76089365, 1.58692212]
yhat = model.predict([row])
# summarize the prediction
print('Prediction: %d' % yhat[0])


Prediction: -190


## Interpretation

A high information gain suggests that splitting the dataset based on the chosen feature effectively reduces uncertainty about the class labels. Features with higher information gain are considered more informative and are preferred in decision tree algorithms and other machine learning models.

## Application

Information gain is widely used in decision tree algorithms, such as ID3 (Iterative Dichotomiser 3) and C4.5, to select the best features for splitting nodes in the tree. By recursively choosing features with the highest information gain, decision trees can efficiently partition the feature space and make accurate predictions.

In summary, information gain plays a crucial role in feature selection and decision making in machine learning by quantifying the usefulness of features in reducing uncertainty about class labels.

In [None]:
def entropy(p):
    if p == 0:
        return 0
    elif p == 1:
        return 0
    else:
        return - (p * np.log2(p) + (1 - p) * np.log2(1-p))


In [None]:
def information_gain(left_child, right_child):
    parent = ______ + ______
    p_parent = parent.count(1) / len(parent) if len(parent) > 0 else 0
    p_left = left_child.count(1) / len(left_child) if len(left_child) > 0 else 0
    p_right = right_child.count(1) / len(right_child) if len(right_child) > 0 else 0
    IG_p = ______(p_parent)
    IG_l = ______(p_left)
    IG_r = ______(p_right)
    return IG_p - len(left_child) / len(parent) * IG_l - len(right_child) / len(parent) * IG_r

<h5><font color=blue>Check result by executing below... 📝</font></h5>

In [None]:
%%ipytest -qq
def test_information_gain_basic():
    # 基本情况：左右子节点都含有数据
    left_child = [1, 0, 1, 0, 1]  # 示例左子节点
    right_child = [0, 1, 0, 1, 0]  # 示例右子节点
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

def test_information_gain_empty_left():
    # 左子节点为空的情况
    left_child = []  # 示例左子节点为空
    right_child = [0, 1, 0, 1, 0]  # 示例右子节点
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

def test_information_gain_empty_right():
    # 右子节点为空的情况
    left_child = [1, 0, 1, 0, 1]  # 示例左子节点
    right_child = []  # 示例右子节点为空
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

def test_information_gain_both_empty():
    # 左右子节点均为空的情况
    left_child = []  # 示例左子节点为空
    right_child = []  # 示例右子节点为空
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

def test_information_gain_extreme_values():
    # 边缘情况：当p等于0或1时的情况
    left_child = [1, 1, 1, 1]  # 示例左子节点全部为1
    right_child = [0, 0, 0, 0]  # 示例右子节点全部为0
    assert information_gain(left_child, right_child) == pytest.approx(1.0)

    left_child = [0, 0, 0, 0]  # 示例左子节点全部为0
    right_child = [1, 1, 1, 1]  # 示例右子节点全部为1
    assert information_gain(left_child, right_child) == pytest.approx(1.0)

    left_child = [1, 1, 1, 1]  # 示例左子节点全部为1
    right_child = []  # 示例右子节点为空
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

    left_child = []  # 示例左子节点为空
    right_child = [1, 1, 1, 1]  # 示例右子节点全部为1
    assert information_gain(left_child, right_child) == pytest.approx(0.0)

<div class="alert alert-info">
    
<details><summary>👩‍💻 <b>Hint</b></summary>

You can consider to fill <code>left_child</code> and <code>right_child</code>.

</details>

</div>