# 95-865: Co-occurrence Analysis Toy Example

Author: George H. Chen (georgechen [at symbol] cmu.edu)

For this demo to work, please be sure to download this pickle file [[click here](https://www.andrew.cmu.edu/user/georgech/95-865/co_occurrence_demo_docs.pickle)] and save it to the same directory as this Jupyter notebook.

We will only keep track of a few people and a few companies:

In [1]:
people = ['Elon Musk', 'Sundar Pichai', 'Lisa Su']
companies = ['Alphabet', 'AMD', 'Tesla']

We load in some preprocessed text documents.

In [2]:
import pickle

with open('co_occurrence_demo_docs.pickle', 'rb') as f:
    docs = pickle.load(f)

FileNotFoundError: [Errno 2] No such file or directory: 'co_occurrence_demo_docs.pickle'

In [3]:
type(docs)

list

In [4]:
len(docs)

25000

The variable `docs` is a list consisting of text documents, where each text document is represented as a list containing names of people and companies (where we only keep track of the names present in the variables `people` and `companies` above; so a document that doesn't mention any of the people in `people` and also doesn't mention any of the companies in `companies` would be represented as an empty list). For example, we can look at the document \#837:

In [5]:
docs[837]

['Elon Musk',
 'Tesla',
 'Elon Musk',
 'Tesla',
 'Tesla',
 'Elon Musk',
 'Elon Musk',
 'Tesla',
 'Tesla',
 'Elon Musk',
 'Elon Musk',
 'Tesla',
 'Lisa Su',
 'AMD']

In [6]:
docs[0]

[]

### Computing co-occurrence probabilities, and then sorting them from largest to smallest

In [7]:
all_pairs = [(person, company)
             for person in people
             for company in companies]

In [8]:
all_pairs

[('Elon Musk', 'Alphabet'),
 ('Elon Musk', 'AMD'),
 ('Elon Musk', 'Tesla'),
 ('Sundar Pichai', 'Alphabet'),
 ('Sundar Pichai', 'AMD'),
 ('Sundar Pichai', 'Tesla'),
 ('Lisa Su', 'Alphabet'),
 ('Lisa Su', 'AMD'),
 ('Lisa Su', 'Tesla')]

In [9]:
from collections import Counter
co_occurrence_probabilities = Counter()
for person, company in all_pairs:
    count = 0
    for doc in docs:
        if person in doc and company in doc:
            count += 1
    co_occurrence_probabilities[(person, company)] = count / len(docs)

In [10]:
co_occurrence_probabilities.most_common()

[(('Elon Musk', 'Tesla'), 0.53652),
 (('Elon Musk', 'AMD'), 0.08952),
 (('Elon Musk', 'Alphabet'), 0.0824),
 (('Sundar Pichai', 'Alphabet'), 0.04076),
 (('Lisa Su', 'AMD'), 0.02876),
 (('Sundar Pichai', 'Tesla'), 0.02608),
 (('Lisa Su', 'Tesla'), 0.01704),
 (('Sundar Pichai', 'AMD'), 0.0066),
 (('Lisa Su', 'Alphabet'), 0.004)]

Is it really the case that (Elon Musk, Tesla), (Elon Musk, AMD), and (Elon Musk, Alphabet) are truly the three most interesting person-company pairs? Perhaps ranking by co-occurrence probabilities alone isn't the best way to figure out what are the most interesting person-company pairs...

In this case, it seems like Elon Musk might just be appearing a lot. The next approach provides a principled way of down-weighting specific people or companies that occur too frequently.

### Let's first look at *marginal* probabilities

These are the probabilities of an individual person occurring, or of an individual company occurring.

In [11]:
people_probabilities = Counter()
for person in people:
    count = 0
    for doc in docs:
        if person in doc:
            count += 1
    people_probabilities[person] = count / len(docs)
print(people_probabilities)

Counter({'Elon Musk': 0.596, 'Sundar Pichai': 0.04564, 'Lisa Su': 0.03048})


In [12]:
company_probabilities = Counter()
for company in companies:
    count = 0
    for doc in docs:
        if company in doc:
            count += 1
    company_probabilities[company] = count / len(docs)
print(company_probabilities)

Counter({'Tesla': 0.53772, 'AMD': 0.102, 'Alphabet': 0.09868})


### Computing pointwise mutual information (PMI), and then sorting from largest to smallest

Recall that PMI is defined as:

$$\log \frac{P(A,B)}{P(A)P(B)}$$

In the code below, we use natural log.

In [13]:
from math import log  # natural log
pmi_scores = Counter()
for person, company in all_pairs:
    ratio = co_occurrence_probabilities[(person, company)] / (people_probabilities[person] * company_probabilities[company])
    pmi_scores[(person, company)] = log(ratio)

In [14]:
pmi_scores.most_common()

[(('Lisa Su', 'AMD'), 2.2246972677322665),
 (('Sundar Pichai', 'Alphabet'), 2.2027896706816303),
 (('Elon Musk', 'Tesla'), 0.515280473364625),
 (('Elon Musk', 'AMD'), 0.38700386263618614),
 (('Sundar Pichai', 'AMD'), 0.34906758973637103),
 (('Elon Musk', 'Alphabet'), 0.33721775717105734),
 (('Lisa Su', 'Alphabet'), 0.28509661762242633),
 (('Sundar Pichai', 'Tesla'), 0.060801512460662566),
 (('Lisa Su', 'Tesla'), 0.03891009097880922)]