## Gini Impurity

Gini impurity is a measure of how impure a node is in a decision tree.

Gini impurity is used to create a decision tree, and it is the default criterion in [`sklearn.tree.DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

$$\text{Gini Impurity} = \sum\limits_{k} p_{k} * (1 - p_{k}) = \sum\limits_{k} p_{k} - p_{k}^2 = \sum\limits_{k} p_{k} - \sum\limits_{k} p_{k}^2 = 1 - \sum\limits_{k} p_{k}^2$$
where k = class k of target

Gini impurity ranges from 0 to 0.5. If every observations in the node is from one class, then gini impurity is 1 - 1^2 = 0. If half of the observations are from one class and other half are from the other class, then gini impurity = 1 - (0.5^2 + 0.5^2) = 0.5.

Other criterion for decision tree is log loss or entropy and is given by the formula:

$$\text{Log Loss or Entropy} = -\sum\limits_{k} p_{k}log(p_{k})$$

Both crierions measure impurity of the node and have higher values when the impurity of the tree node increases. When there is no impurity, both criterions are equal 0. You can view more about log loss in [this notebook](binary_cross_entropy.ipynb).

In [1]:
import numpy as np

def gini_impurity(y):
    _, counts = np.unique(y, return_counts=True)
    p = counts / len(y)
    return 1 - np.sum(p ** 2)

    return gini

# Example usage:
example = [1, 1, 1, 1, 1]
print(f"Gini Impurity {example}: {gini_impurity(example)}")

example = [0, 0, 0, 0, 0]
print(f"Gini Impurity {example}: {gini_impurity(example)}")

example = [0, 0, 1, 1, 0]
print(f"Gini Impurity {example}: {gini_impurity(example)}")

example = [1, 1, 1, 1, 0]
print(f"Gini Impurity {example}: {gini_impurity(example):.2f}")

example = [0, 0, 1, 1]
print(f"Gini Impurity {example}: {gini_impurity(example)}")

Gini Impurity [1, 1, 1, 1, 1]: 0.0
Gini Impurity [0, 0, 0, 0, 0]: 0.0
Gini Impurity [0, 0, 1, 1, 0]: 0.48
Gini Impurity [1, 1, 1, 1, 0]: 0.32
Gini Impurity [0, 0, 1, 1]: 0.5
