# 94-775/95-865: Co-Occurrence Analysis for Finding Possible Relationships

Author: George H. Chen (georgechen [at symbol] cmu.edu)

We begin by importing `numpy`, and telling it to displays numbers to 5 decimal places and suppress printing out tiny numbers in scientific notation:

In [1]:
import numpy as np
# for pretty printing
np.set_printoptions(precision=5, suppress=True)

## Calculating joint and marginal probability tables given a co-occurrence table

We work off the following co-occurrence table:


| <i></i>         | Apple | Facebook | Tesla |
| --------------- |:-----:|:--------:|:-----:|
| Elon Musk       | 10    | 15       | 300   |
| Mark Zuckerberg | 500   | 10000    | 500   |
| Tim Cook        | 200   | 30       | 10    |


So Elon Musk and Apple co-occur in 10 news articles, Elon Musk and Facebook co-occur in 15 news articles, etc.

In [23]:
table = np.array([[10, 15, 300],
                                [500, 10000, 500],
                                [200, 30, 10]])
print(table)

[[   10    15   300]
 [  500 10000   500]
 [  200    30    10]]


In [24]:
proba = table/table.sum()

In [25]:
proba

array([[0.00086, 0.0013 , 0.02594],
       [0.04323, 0.86468, 0.04323],
       [0.01729, 0.00259, 0.00086]])

In [29]:
PersonP = proba.sum(axis = 1)
PersonP

array([0.0281 , 0.95115, 0.02075])

In [30]:
CorpP = proba.sum(axis = 0)
CorpP

array([0.06139, 0.86857, 0.07004])

In [34]:
OK = np.outer(PersonP,CorpP)
OK

array([[0.00173, 0.02441, 0.00197],
       [0.05839, 0.82614, 0.06662],
       [0.00127, 0.01802, 0.00145]])

In [43]:
PMI = np.log2(proba/OK)
PMI

array([[-0.99657, -4.23412,  3.72022],
       [-0.43363,  0.06578, -0.62373],
       [ 3.76277, -2.79671, -0.74926]])

In [48]:
dictt = {}
person = ["Elon Musk", "Mark Zuckerberg", "Tim Cook"]
corp = ["Apple", "Facebook", "Tesla"]
for cp,p in enumerate(person):
    for cc, c in enumerate(corp):
        print("PMI between", p, " and ", c, "is ", PMI[cp,cc])
        dictt["PMI between " + p + " and " + c] = PMI[cp,cc]
    
        

PMI between Elon Musk  and  Apple is  -0.996565381896946
PMI between Elon Musk  and  Facebook is  -4.2341176104044
PMI between Elon Musk  and  Tesla is  3.7202223303316297
PMI between Mark Zuckerberg  and  Apple is  -0.4336291875057887
PMI between Mark Zuckerberg  and  Facebook is  0.06578417815296361
PMI between Mark Zuckerberg  and  Tesla is  -0.6237320708857317
PMI between Tim Cook  and  Apple is  3.7627680252977145
PMI between Tim Cook  and  Facebook is  -2.796712298097102
PMI between Tim Cook  and  Tesla is  -0.7492629529695907


In [50]:
list(dictt.items())

[('PMI between Elon Musk and Apple', -0.996565381896946),
 ('PMI between Elon Musk and Facebook', -4.2341176104044),
 ('PMI between Elon Musk and Tesla', 3.7202223303316297),
 ('PMI between Mark Zuckerberg and Apple', -0.4336291875057887),
 ('PMI between Mark Zuckerberg and Facebook', 0.06578417815296361),
 ('PMI between Mark Zuckerberg and Tesla', -0.6237320708857317),
 ('PMI between Tim Cook and Apple', 3.7627680252977145),
 ('PMI between Tim Cook and Facebook', -2.796712298097102),
 ('PMI between Tim Cook and Tesla', -0.7492629529695907)]

In [53]:
from operator import itemgetter
sortedList = sorted(list(dictt.items()), reverse = True, key = itemgetter(1))
sortedList[:3]

[('PMI between Tim Cook and Apple', 3.7627680252977145),
 ('PMI between Elon Musk and Tesla', 3.7202223303316297),
 ('PMI between Mark Zuckerberg and Facebook', 0.06578417815296361)]

In [3]:
co_occurrence_table.shape

(3, 3)

In [4]:
num_people, num_companies = co_occurrence_table.shape

In [6]:
co_occurrence_table.sum()

11565

The joint probability table can be obtained by dividing every entry of the co-occurrence table by the total number of co-occurrences:

In [5]:
joint_prob_table = co_occurrence_table / co_occurrence_table.sum()
print(joint_prob_table)

[[0.00086 0.0013  0.02594]
 [0.04323 0.86468 0.04323]
 [0.01729 0.00259 0.00086]]


In [7]:
joint_prob_table.shape

(3, 3)

To get the marginal probabilities **P(Elon Musk)**, **P(Mark Zuckerberg)**, and **P(Tim Cook)**, we sum the joint probability table across columns. In `numpy`, axis 0 corresponds to rows, and axis 1 corresponds to columns. To sum across columns, we do the following:

In [14]:
joint_prob_table.sum(axis=1)

array([0.0281 , 0.95115, 0.02075])

In [8]:
people_prob = joint_prob_table.sum(axis=1)
print(people_prob)

[0.0281  0.95115 0.02075]


Don't get confused, this is how it converted 

In [15]:
people_prob.shape

(3,)

In [16]:
type(people_prob)

numpy.ndarray

In [17]:
np.array([people_prob]).T.shape

(3, 1)

In [18]:
type(people_prob)

numpy.ndarray

To get the marginal probabilities **P(Apple)**, **P(Facebook)**, and **P(Tesla)**, we sum across rows of the joint probability table:

In [19]:
company_prob = joint_prob_table.sum(axis=0)
print(company_prob)

[0.06139 0.86857 0.07004]


Next, we compute what the joint probability table would be *if people and companies were independent*. We show two different approaches for doing this. The first is a straightforward calculation that uses the formula $P(A, B)=P(A)P(B)$ when $A$ and $B$ are independent:

In [20]:
joint_prob_table_if_people_and_companies_were_indep = np.zeros((num_people, num_companies))
for row_idx in range(num_people):
    for col_idx in range(num_companies):
        joint_prob_table_if_people_and_companies_were_indep[row_idx, col_idx] = people_prob[row_idx] * company_prob[col_idx]
print(joint_prob_table_if_people_and_companies_were_indep)

[[0.00173 0.02441 0.00197]
 [0.05839 0.82614 0.06662]
 [0.00127 0.01802 0.00145]]


The more elegant, slicker approach for those who know linear algebra is to recognize that we just need to take the outer product between the marginal probabilities for people and the marginal probabilities for companies:

In [21]:
joint_prob_table_if_people_and_companies_were_indep = np.outer(people_prob, company_prob)
print(joint_prob_table_if_people_and_companies_were_indep)

[[0.00173 0.02441 0.00197]
 [0.05839 0.82614 0.06662]
 [0.00127 0.01802 0.00145]]


## Computing pointwise mutual information (PMI)

Next, we compute the PMI between each person and each company. The formula for PMI is $$PMI(A,B)=\log\frac{P(A,B)}{P(A)P(B)},$$where the base of the logarithm is not actually important (we will use log base 2).

In [22]:
PMI = np.log2(joint_prob_table / joint_prob_table_if_people_and_companies_were_indep)
print(PMI)

[[-0.99657 -4.23412  3.72022]
 [-0.43363  0.06578 -0.62373]
 [ 3.76277 -2.79671 -0.74926]]


Self practice: 

list of strings for rows

list of strings for colomns

print a list of ranked pair

By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the CEO/company pairings.

## Computing phi-square and chi-square metrics that tell us how far people and companies are from being independent

PMI compares $P(A,B)$ and $P(A)P(B)$ by looking at the log of their ratio. Phi-square instead looks at
$$\sum_{A,B} \frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$

If $N$ is the total number of co-occurrences in the original co-occurrence table, then chi-square is $N$ multiplied by the phi-square value.

In [None]:
numer = (joint_prob_table - joint_prob_table_if_people_and_companies_were_indep)**2

In [None]:
print(numer)

In [None]:
denom = joint_prob_table_if_people_and_companies_were_indep

In [None]:
numer / denom

In [None]:
phi_square = (numer / denom).sum()
print(phi_square)

0 means independent, here

In [None]:
N = co_occurrence_table.sum()
print(N)

In [None]:
chi_square = N * phi_square
print(chi_square)