In [1]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Homogeneity Metrics

© 2018 Daniel Voigt Godoy

## 1. Definitions

### Entropy

***Entropy*** is a measure of ***uncertainty*** associated with a given ***distribution q(y)***.

From Wikipedia:

    ...is the average rate at which information is produced by a stochastic source of data. 
    
    ...when a low-probability event occurs, the event carries more "information" ("surprisal")...    

$$
H(q) = -\sum_{c=1}^{C}{q(y_c) \cdot log(q(y_c))}
$$

where:
 - ***q*** is the ***distribution*** (as in the distribution of red and green balls)
 - ***y*** are the ***labels*** (the respective colors of each ball)
 - ***C*** is the number of ***classes*** (as in ***red*** and ***green*** - 2 classes)
 - ***q(yc) represents the proportion of balls having the same color c***

### Gini Impurity

***Gini Impurity*** is a measure of ***heterogeneity*** associated with a given ***distribution q(y)***.

$$
G(q) = \sum_{c=1}^{C}{q(y_c) \cdot (1 - q(y_c))}
$$

From Wikipedia:

    ...is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

In [2]:
from intuitiveml.Information import *

In [3]:
X, y = data(10)
myinfo = plotInfo(X, y)
vb = VBox(build_figure(myinfo), layout={'align_items': 'center'})


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



## 2. Experiment

There are 10 balls (data points) of two possible colors (***classes***). Each ball has its own color (***label***), red or green.

The slider control at the bottom allows you to change the number of red balls and, consequently, the number of green balls (the total stays the same) - so, you are changing the ***distribution***.

This change will have an impact on both ***entropy*** and ***gini impurity*** measures.

Use the slider to play with different configurations and answer the ***questions*** below.

In [4]:
vb

VBox(children=(FigureWidget({
    'data': [{'marker': {'color': 'green', 'line': {'color': 'black', 'width': 2…

#### Questions:

1. How to maximize (minimize) Entropy?

5 red balls

2. How to maximize (minimize) Gini Impurity?

5 red balls

3. What's the entropy when all balls have the same color?

either 0 or 10

4. What kind of distribution yields the maximum Entropy?

10


5. Using the formula, compute the ***entropy*** if you had 3 red balls



6. Using the formula, compute the ***gini impurity*** if you had 7 red balls





#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [5]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')