# Topic Contiguity 

Topic contiguity refers to the relationship of topics to each other in practice, not to their similarity. From the perspective of an associationist epistemology, contiguity is the opposite of similarity. Contiguous topics are those that co-occur frequently enough to suggest "syndromes." For example, if the topics of genetic engineering and acquaculture are contiguous, this suggests an assemblage of practices relating to the application of a kind of knowledge to a specific industry. We use pointwise mutual information  to surface topic contiguity, similar to how this concept is used in association rule mining.

# Set Up

## Imports

In [None]:
import pandas as pd
import numpy as np
from lib import tapi

## Configuration

In [None]:
tapi.list_dbs()

In [None]:
data_prefix = 'jstor_hyperparameter_demo'
topic_glosses = ['Bayesian models', 'French', 'MCMC', 'priors', 'economics', 'random effects', 'variable selection',
                 'empirical Bayes', 'env biology', 'genetics']

In [None]:
# data_prefix = 'winereviews'
# topic_glosses = []

In [None]:
# data_prefix = 'tamilnet'
# topic_glosses = []

## Import Topic Data

We import our previously generated model.

In [None]:
db = tapi.Edition(data_prefix)

In [None]:
db.get_tables()

In [None]:
if len(topic_glosses) > 0:
    db.TOPICS_NMF['gloss'] = topic_glosses # THIS SHOULD BE DONE EARLIER IN THE PIPELINE
else:
    db.TOPICS_NMF['gloss'] = db.TOPICS_NMF.topwords.str.extract(r'^(.{20})')

# Compute Contiguity

## By Correlation

### Create Topic Pairs

In [None]:
topic_pairs = db.THETA_NMF.corr().stack().to_frame('topic_corr')
topic_pairs.index.names = ['t1', 't2']
topic_pairs = topic_pairs.loc[topic_pairs.apply(lambda x: x.name[0] < x.name[1], 1)]

topic_pairs['z_score'] = (topic_pairs.topic_corr - topic_pairs.topic_corr.mean()) / topic_pairs.topic_corr.std()
topic_pairs['t1_gloss'] = topic_pairs.apply(lambda x: db.TOPICS_NMF.loc[x.name[0]].gloss, 1)
topic_pairs['t2_gloss'] = topic_pairs.apply(lambda x: db.TOPICS_NMF.loc[x.name[1]].gloss, 1)

In [None]:
topic_pairs.sort_values('topic_corr', ascending=False).head(10)

In [None]:
topic_pairs.reset_index().set_index(['t1_gloss','t2_gloss']).topic_corr.sort_values()\
    .plot.barh(figsize=(5, db.n_topics * 2), legend=False);

### View Network

In [None]:
import pydot
from IPython.display import SVG, display

In [None]:
def show_graph(quantile=.5, measure='topic_corr'):
    
    thresh = topic_pairs[measure].quantile(quantile)
    
    graph = pydot.Dot('topic_graph', graph_type='graph')

    nodes = []
    for i in topic_pairs[topic_pairs[measure] >= thresh].index:

        nodes.append(i[0])
        nodes.append(i[1])

        m = topic_pairs.loc[i][measure].round(2)    
        graph.add_edge(pydot.Edge(i[0], i[1], 
                                  label=m, 
                                  color='lightgray', 
                                  fontsize=10, 
                                  fontcolor='green',
                                  fontname='Arial'))

    for node in list(set(nodes)):
        node_gloss = 'T' + str(node) + ": " + db.TOPICS_NMF.loc[node, 'gloss']
        graph.add_node(pydot.Node(node, 
                                  label=node_gloss, 
                                  shape='plain', 
                                  fontname='Arial'))

    display(SVG(graph.create_svg()))

In [None]:
show_graph(.9)

## By Mutual Information

### Compute Marginal Probabilities

In [None]:
# db.TOPICS_NMF['p'] = db.THETA_NMF.sum() / db.THETA_NMF.sum().sum()

In [None]:
db.TOPICS_NMF['p'] = db.PHI_NMF.T.sum() / db.PHI_NMF.T.sum().sum()

### Compute Joint Probabilities

In [None]:
tw_thresh = 0

In [None]:
N = db.THETA_NMF.shape[0]

In [None]:
topic_pairs['p_ab'] = topic_pairs.apply(lambda x: 
                                        db.THETA_NMF[(db.THETA_NMF[x.name[0]] > tw_thresh) 
                                        & (db.THETA_NMF[x.name[1]] > tw_thresh)].shape[0] / N, 1)

In [None]:
topic_pairs.sort_values('p_ab', ascending=False).head(10)

### Compute PWMI


**From Bouma:**

"Pointwise  mutual  information  (PMI,  5)  is  a  measure  of  how  much  the  actual probability of a particular co-occurrence of events $p(x, y)$ differs from what we would expect it to be on the basis of the probabilities of the individual events and the assumption of independence $p(x)$ $p(y)$."

[Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction." _Proceedings of the Biennial GSCL Conference_.](https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf)

**From Raviv:**

"The pointwise mutual information can be understood as a scaled conditional probability."

"Pointwise mutual information measure is not confined to the $[0,1]$ range. So here we explain how to interpret a zero, a positive or, as it is in our case, a negative number. The case where $PMI=0$ is trivial. It occurs for $log(1) =0$ and it means that $p(x,y) = p(x)p(y)$ which tells us that $x$ and $y$ are independents. If the number is positive it means that the two events co-occuring in a frequency higher than what we would expect if they would be independent event. Why? because $p(y \vert x) \times \frac{1}{p(x)}$ (or equivalently $p(x \vert y) \times \frac{1}{p(y)})$ is larger than $1$ (if it’s smaller than $1$, the log is negative). In our case the number is lower than one, meaning $p(y \vert x) < p(x)$ which means we see more of $X=x$ than we see $y$ given that $X=x$. 

https://eranraviv.com/understanding-pointwise-mutual-information-in-statistics/

In [None]:
import math

In [None]:
def pwmi(p_a, p_b, p_ab):
    """Computes the adjusted point-wise mutual information of two items (a and b)
    that appear in container vectors of some kind, e.g. items in a shopping
    basket."""

    if p_ab > 0:
        pmi_ab = math.log2(p_ab / (p_a * p_b))  # Raw
        i = math.log2(1/p_ab)                   # Surprise (info) 
        h = p_ab * i                            # Entropy
#         apmi_ab = pmi_ab / h                    # Adjusted
        npmi_ab = pmi_ab / i                    # Normalized
    else:
#         apmi_ab = 0
        npmi_ab = 0

    return npmi_ab

In [None]:
topic_pairs['pwmi'] = topic_pairs.apply(lambda x: pwmi(db.TOPICS_NMF.loc[x.name[0]].p, 
                                                       db.TOPICS_NMF.loc[x.name[1]].p,
                                                      x.p_ab), 1)

In [None]:
topic_pairs.sort_values('pwmi', ascending=False).head(10)

In [None]:
show_graph(.75, 'pwmi')

## Compare Contiguity Measures

In [None]:
import plotly_express as px

In [None]:
labels = topic_pairs[['t1_gloss','t2_gloss']].apply(lambda x: '<br>'.join(x), 1)

In [None]:
px.scatter(topic_pairs, 'topic_corr', 'pwmi', size='p_ab', text=labels, height=1000, width=1000)

In [None]:
px.scatter_3d(topic_pairs, 'topic_corr', 'pwmi', 'p_ab', text=labels, height=1000, width=1000)