# Binary Data / Measures of Similarity and Dissimilarity

## Binary data

In this section we'll consider binary data -- that is data where the attributes take on the values True or False, or 1 and 0, respectively.

If we consider comparing two objects $A$ and $B$, with $n$ binary attributes, we only have four cases to consider, enumerated in the _contingency table_ $M$ below:

<table>
  <tr>
    <td colspan=2/>
    <th colspan=2 style="text-align:center">A</th>
  </tr>
    <td colspan=2></td>
    <td style="text-align:center">True<br></td>
    <td style="text-align:center">False<br></td>
  </tr>
  <tr>
    <td rowspan=2><b>B</b></td>
    <td>True</td>
    <td>M11 = # attributes True<br>in both A and B<br></td>
    <td>M01 = # attributes where <br>A is False and <br>B is True<br></td>
  </tr>
  <tr>
    <td>False</td>
    <td>M10 = # attributes where<br>A is True and<br>B is False<br></td>
    <td>M00 = # attributes False <br>in both A and B<br></td>
  </tr>
</table>

## Similarity in binary data - JACCARD COEFFICIENT


Say we want to produce the ratio of how similar two objects $A$ and $B$ are considering their binary attributes $\{f_1, f_2, \ldots, f_n\}$. A typical method for doing this is called the **Jaccard coefficient**.  ${Jaccard}(A, B)$ is the ratio of common features of $A$ and $B$, such that the larger the coefficient, the more in similar $A$ and $B$ are:

$${Jaccard}(A, B) = {{M_{11}} \over {M_{10} + M_{01} + M_{11}}}$$ 


Note also, $ 0 \leq Jaccard(A,B) \leq 1 $.

## Dissimilarity in binary data

Two binary objects are more similar if the number of things they have in common are large, then they **are dissimilar if they have few things in common**.

Formally, let $\hat{d}$, the _disimilarity ratio_ (or distance) between $A$ and $B$ be

$$ \hat{d}(A, B) = { {M_{10} + M_{01}} \over {M_{10} + M_{01} +  M_{11}} }$$

where $M_{01}$, $M_{10}$ and $M_{11}$ are defined as above.  Note also, $0 \leq \hat{d}(A, B) \leq 1.$

### Example
Now let's look at a real example.  Let $A$ and $B$ have 6 binary attributes each:

| | Red | Small | Round | Heavy | Shiny | Fragrant |
|-|:---:|:-----:|:-----:|:-----:|:-----:|:--------:|
|A| True|False  |True   |False  |True   |False     |
|B| True|True  |False   |False  |False  |False     |

Our contigency table is:

<table>
  <tr>
      <td rowspan=2 colspan=2/>
  </tr>
  <tr>
    <td colspan=2 style="text-align:center">A</td>
  </tr>
  <tr>
    <td colspan=2></td>
    <td>True<br></td>
    <td>False<br></td>
  </tr>
  <tr>
    <td rowspan="3">B</td>
    <td>True</td>
    <td>1<br></td>
    <td>1<br></td>
  </tr>
  <tr>
    <td>False</td>
    <td>2<br></td>
    <td>2<br></td>
  </tr>
</table>

The Jaccard similarity is :  $ {1 \over  {1 + 1 + 2}} = 0.25 $ the dissimilarity is $ { {2+1} \over {2+1+2} } = {3 \over 5} = .60$.  Which is telling us these two objects are less similar or more dissimilar.

In [1]:
import random

# naive implementation of Jaccard
def jaccard(V_1, V_2):
    # check V_1 V_2 binary
    v_01, v_10, v_11, v_00 = 0, 0, 0, 0
    
    for v1, v2 in zip(V_1, V_2):
        if v1 not in [0,1] and v2 not in [0,1]:
            raise TypeError
        if v1 == 0:
            if v2 == 0 : v_00 += 1
            if v2 == 1 : v_01 += 1
        if v1 == 1:
            if v2 == 0 : v_10 += 1
            if v2 == 1 : v_11 += 1
    return 1.*v_11 / (v_01+v_10+v_11) 


for i in xrange(100):
    v1 = [random.randint(0,1) for v in xrange(10)]
    v2 = [random.randint(0,1) for v in xrange(10)]
    
    # for fun: just print those random vectors that are really similar 
    j_sim = jaccard(v1, v2) 
    if j_sim > 0.70:
        print "*{}\n{}\n{}\n--".format(j_sim, v1, v2)

*1.0
[1, 1, 1, 0, 0, 1, 1, 0, 1, 0]
[1, 1, 1, 0, 0, 1, 1, 0, 1, 0]
--
*0.833333333333
[0, 0, 1, 1, 0, 1, 0, 1, 1, 0]
[0, 1, 1, 1, 0, 1, 0, 1, 1, 0]
--
*0.75
[0, 0, 0, 1, 0, 1, 0, 0, 1, 0]
[0, 0, 0, 1, 0, 1, 0, 1, 1, 0]
--


### RESOURCES
There is a rather large list of similarity measures listed [here](http://www.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.statistics.help/cmd_proximities_dissim_measure_binary.htm) for more information on similarity and dissimilarity discussed in further detail.
