# Gini Index (Goldman Sachs)
##### *AI/ML Domain*

Explain the concept of the Gini Index. What is the process of the gini index calculation?

### Solution
Gini index, or gini impurity, measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. It is used to determine the order in which to add classifiers to a decision tree, and can be expressed mathematically as:

$$Gini=1-\sum_{i=1}^n(p_i)^2$$

where $p_i$ is the probability of an object being classified to a particular class.

The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions. Imagine, you want to draw a decision tree and wish to decide which feature/column you should use for your first split. This is probably defined by your gini index.

The pseudo-code below illustrates the algorithm used to calculate your gini index:
```
gini_index():
    for each branch in split:
        Calculate percent branch represents. (used for weighting)

        for each class in branch:
            Calculate probability of class in the given branch.
            Square the class probability.

        Sum the squared class probabilities.
        Subtract the sum from 1. (this is the gini index for branch

    Weight each branch based on the baseline probability.
    Sum the weighted gini index for each split.
```

Let's take a look at an example.

<table style="margin-left: auto; margin-right: auto;" border="1">
<tbody>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.86"><strong>Past Trend</strong></td>
<td style="text-align: center;" title="Character Accurarcy=99.7"><strong>Open Interest</strong></td>
<td style="text-align: center;" title="Character Accurarcy=99.75"><strong>Trading Volume</strong></td>
<td style="text-align: center;" title="Character Accurarcy=99.83"><strong>Return</strong></td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.86">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.91">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.65">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.6">Up</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.92">Negative</td>
<td style="text-align: center;" title="Character Accurarcy=99.63">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.92">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.95">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.91">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.93">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.77">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.72">Up</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.89">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.79">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.73">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.71">Up</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.9">Negative</td>
<td style="text-align: center;" title="Character Accurarcy=99.93">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.78">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.95">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.85">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.93">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.9">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.94">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.93">Negative</td>
<td style="text-align: center;" title="Character Accurarcy=99.78">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.73">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.95">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.92">Negative</td>
<td style="text-align: center;" title="Character Accurarcy=99.93">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.8">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.94">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.77">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.93">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.91">Low</td>
<td style="text-align: center;" title="Character Accurarcy=99.95">Down</td>
</tr>
<tr>
<td style="text-align: center;" title="Character Accurarcy=99.8">Positive</td>
<td style="text-align: center;" title="Character Accurarcy=99.59">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.85">High</td>
<td style="text-align: center;" title="Character Accurarcy=99.72">Up</td>
</tr>
</tbody>
</table>

Say we want to build a decision tree to predict whether Return will be Up or Down based on the values of Past Trend, Open Interest, and Trading Volume.

**Resulting Gini Indexes**
<table style="margin-left: auto; margin-right: auto;" border="1">
<tbody>
<tr>
<td style="text-align: center;"><strong>Attributes/Features</strong></td>
<td style="text-align: center;"><strong>Gini Index</strong></td>
</tr>
<tr>
<td style="text-align: center;">Past Trend</td>
<td style="text-align: center;">0.27</td>
</tr>
<tr>
<td style="text-align: center;">Open Interest</td>
<td style="text-align: center;">0.47</td>
</tr>
<tr>
<td style="text-align: center;">Trading Volume</td>
<td style="text-align: center;">0.34</td>
</tr>
</tbody>
</table>

From the table above, we observe that Past Trend has the lowest Gini Index, thus it is chosen as the root node of the decision tree. From this root node, we have two branches, 'Positive' and 'Negative', corresponding to the values that Past Trend can take. We now calculate Gini Indexes within the 'Positive' branch of Past Trend. We begin with a subset of the original table which only includes observations with a 'Positive' Past Trend.

<table style="width: 443px; margin-left: auto; margin-right: auto;">
<tbody>
<tr>
<td style="width: 100.69px; text-align: center;"><strong>Past Trend</strong></td>
<td style="width: 111.745px; text-align: center;"><strong>Open Interest</strong></td>
<td style="width: 138.932px; text-align: center;"><strong>Trading Volume</strong></td>
<td style="width: 65.4818px; text-align: center;"><strong>Return</strong></td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">Low</td>
<td style="width: 138.932px; text-align: center;">High</td>
<td style="width: 65.4818px; text-align: center;">Up</td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">Low</td>
<td style="width: 138.932px; text-align: center;">High</td>
<td style="width: 65.4818px; text-align: center;">Up</td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">High</td>
<td style="width: 138.932px; text-align: center;">High</td>
<td style="width: 65.4818px; text-align: center;">Up</td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">Low</td>
<td style="width: 138.932px; text-align: center;">Low</td>
<td style="width: 65.4818px; text-align: center;">Down</td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">Low</td>
<td style="width: 138.932px; text-align: center;">Low</td>
<td style="width: 65.4818px; text-align: center;">Down</td>
</tr>
<tr>
<td style="width: 100.69px; text-align: center;">Positive</td>
<td style="width: 111.745px; text-align: center;">High</td>
<td style="width: 138.932px; text-align: center;">High</td>
<td style="width: 65.4818px; text-align: center;">Up</td>
</tr>
</tbody>
</table>

**Resulting Gini Indexes**
<table style="margin-left: auto; margin-right: auto;" border="1">
<tbody>
<tr>
<td style="text-align: center;"><strong>Attributes/Features</strong></td>
<td style="text-align: center;"><strong>Gini Index</strong></td>
</tr>
<tr>
<td style="text-align: center;">Open Interest</td>
<td style="text-align: center;">0.33</td>
</tr>
<tr>
<td style="text-align: center;">Trading Volume</td>
<td style="text-align: center;">0</td>
</tr>
</tbody>
</table>

The Trading Volume feature has the lowest Gini Index, so we select it as the next node in the decision tree. From this node, there are branches for 'High' and 'Low', which connect to nodes for the Open Interest feature (since Open Interest is the only unused feature at this point).