# Co-occurance Analysis

Cognitive Control has many faces and for this reason a good part of tests literature targets the Cognitive Control using multiple tests in a single study. On that note, the co-occurances of tests in the literature help us identify tests that are commonly used together, tests that are validated by other tests, and tests that can be ignored due to redundancy or lack of correlation with other standard tests. This co-occurance analysis improves our understanding of coverage and dependencies in the Cognitive Control tests.

This notebooks demonstrates analyses of co-occurances of cognitive tests in PubMed articles.

In [1]:
# Setup

import matplotlib.pyplot as plt
from pathlib import Path
import pandas as pd
import numpy as np
import networkx as nx

files = Path('../data/pubmed/').glob('*.csv')
_corpora = []

for f in files:
    _df = pd.read_csv(f)
    _df['cognitive_test'] = f.stem
    _corpora.append(_df)

CORPUS = pd.concat(_corpora)

# remove irrelevant tasks
CORPUS = CORPUS.query('cognitive_test not in ["STOP","MONITOR"]')

cognitive_tests = CORPUS['cognitive_test'].unique().tolist()

First we need to calculate co-ocurance of two tests given the two corresponding corpora. This co-occurance feaure is conditional/bidirectional given which of the two corpora is being analyzed.

The following code create a matrix called `cooccurances` of size n*n where n is the total number of tests and element `[i,j]` contains the number of articles in corpus `j` that also exist in corpus `i`. Note that values in `[i,j]` and `[j,i]` are not equal.

A second matrix which is called `cooccurances_normalized` contains the same information but values in `[i,j]` cell are normalized by the number of total articles in corpus `j`.


- [ ] **TODO:** Is there a better way to explain how to interpret the cooccurances matrix? Indexing somehow confuses me.

In [2]:
cooccurances = np.zeros((len(cognitive_tests), len(cognitive_tests)))
cooccurances_normalized = np.zeros_like(cooccurances)

for i, test_i in enumerate(cognitive_tests):
    pmids_i = set(CORPUS.query('cognitive_test == @test_i')['pmid'])
    for j, test_j in enumerate(cognitive_tests):
        if i!=j:
            pmids_j = set(CORPUS.query('cognitive_test == @test_j')['pmid'])
            cooccurances[i,j] = len([c for c in pmids_j if c in pmids_i])
            cooccurances_normalized[i,j] = cooccurances[i,j] / len(pmids_j)

# DEBUG cooccurances, cooccurances_normalized

Let's look at the result by finding the most co-occured cognitive tests:

In [3]:
max_index = np.unravel_index(np.argmax(cooccurances, axis=None), cooccurances.shape)
max_norm_index = np.unravel_index(np.argmax(cooccurances_normalized, axis=None), cooccurances_normalized.shape)

print(f'max coocurance: "{cognitive_tests[max_index[0]]}" in "{cognitive_tests[max_index[1]]}" corpus')
print(f'max coocurance (normalized by corpus size): "{cognitive_tests[max_norm_index[0]]}" and "{cognitive_tests[max_norm_index[1]]}"')

max coocurance: "Stroop" in "TMT (Trail Making Test)" corpus
max coocurance (normalized by corpus size): "Sorting Task" and "Spin the Pots"


Let's plot a simple heatmap in that each cell shows co-occurance of two tests.

In [4]:
import seaborn as sns

_, ax = plt.subplots(1,1,figsize=(25,25))
sns.heatmap(cooccurances,
            xticklabels=cognitive_tests,
            yticklabels=cognitive_tests,
            cbar_kws = {'use_gridspec':False, 'location':'top', 'shrink': .5},
            square=True,
            linewidths=.5,
            linecolor='k',
            ax=ax)

plt.xticks(rotation=45, ha='right') 
ax.set(title='Bidirectional co-occurance of cognitive tests')
plt.savefig('../outputs/cooccurances/heatmap.png')
plt.close()

# Now, same plot for the normalized co-occurances matrix

_, ax = plt.subplots(1,1,figsize=(25,25))
sns.heatmap(cooccurances_normalized,
            xticklabels=cognitive_tests,
            yticklabels=cognitive_tests,
            cbar_kws = {'use_gridspec':False, 'location':'top', 'shrink': .5},
            square=True,
            linewidths=.5,
            linecolor='k',
            ax=ax)

plt.xticks(rotation=45, ha='right') 
ax.set(title='Bidirectional normalized co-occurance of cognitive tests')
plt.savefig('../outputs/cooccurances/normalized_heatmap.png')
plt.close()

Here is the same heatmap, but tests are clustered:

In [5]:
# plot cluster map for co-occurance counts
sns.clustermap(cooccurances,
            xticklabels=cognitive_tests,
            yticklabels=cognitive_tests,
            figsize=(15,15))

plt.xticks(rotation=45, ha='right') 
plt.suptitle('Bidirectional co-occurance of cognitive tests')
plt.savefig('../outputs/cooccurances/clustermap.png')
plt.close()

# plot cluster map for normalized co-occurance
sns.clustermap(cooccurances_normalized,
            xticklabels=cognitive_tests,
            yticklabels=cognitive_tests,
            figsize=(15,15))

plt.xticks(rotation=45, ha='right') 
plt.suptitle('Bidirectional normalized co-occurance of cognitive tests')
plt.savefig('../outputs/cooccurances/normalized_clustermap.png')
plt.close()

  linkage = hierarchy.linkage(self.array, method=self.method,


## Visualization preprocessing 

Before plotting co-occurance map as a graph, some proprocessing steps significantly improves the visuals:

1. remove low degree nodes and low weight edges.
2. Use degree centrality as node sizes; degree centrality shows number of connected edges to a given nodes (both inputs and outputs).
3. Find communities in the graph, that are highly connected subgraphs. This property can be used to layout the graph and colorize the nodes.
4. Put co-occurred tests closer to each other by using spring layout.

I will use the de-facto NetworkX package to perform all the graph analysis listed above. The output will be a bidirectional graph that encapsulates centrality of each node, communities, edge widths, and positioning layout.

- [ ] **TODO:** greedy modulatiry algorithm returns one community because nodes are densly connected. Use another algorithm to extract some information from the graph structure.

In [6]:
# normalized co-occurances below this threshold will be ignored
MIN_WEIGHT = 0.0001

# create graph from adjacency matrix
G = nx.from_numpy_matrix(cooccurances_normalized, create_using=nx.MultiDiGraph)

# set node labels
node_labels = {i:t for i,t in enumerate(cognitive_tests)}
nx.set_node_attributes(G, node_labels, 'label')

# remove low weight edges and low degree nodes
low_weight_edges = [(u,v) for u, v in G.edges() if G[u][v][0]['weight']<MIN_WEIGHT]
low_degree_nodes = [n for n, d in G.degree() if d == 0]
G.remove_edges_from(low_weight_edges)
G.remove_nodes_from(low_degree_nodes)

# compute centrality
centrality = nx.degree_centrality(G)

# compute community structure
lpc = nx.community.greedy_modularity_communities(G, weight='weight')
community_index = {n: i for i, com in enumerate(lpc) for n in com}

nx.set_node_attributes(G, community_index, 'community')
nx.set_node_attributes(G, centrality, 'centrality')

# visual properties (all numeric values are relative, update them to match your canvas)
node_colors = community_index
edge_widths = [G[u][v][0]['weight'] for u, v in G.edges()]


# compute layout positions
pos = nx.spring_layout(G, k=.8)

# NetworkX plot (static)

In [7]:
fig, ax = plt.subplots(1,1,figsize=(15,15))

nx.draw_networkx_nodes(G, pos,
                       node_color=list(node_colors.values()),
                       node_size=[max(10,v*4000) for v in centrality.values()],
                       alpha=0.5, ax=ax)
nx.draw_networkx_labels(G, pos, labels=nx.get_node_attributes(G, 'label'), font_size=8, alpha=0.5, ax=ax)
# nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.3, connectionstyle='arc3,rad=0.1')\n",
nx.draw_networkx_edges(G,
                       pos,
                       alpha=0.8,
                       arrowstyle='->',
                       arrowsize=10,
                       connectionstyle='arc3,rad=0.1',
                       width=[v*20 for v in  edge_widths],
                       edge_color='gainsboro',
                       ax=ax)

# plot the graph using matplotlib backend
plt.suptitle('Co-occurance map\n Node size and edge width respectively show centrality degree and number of co-occured articles in ith corpus.')
plt.tight_layout()
plt.savefig('../outputs/cooccurances/map_networkx.png')
plt.close()

## Bokeh plot (interactive)

In [8]:

from bokeh.io import output_notebook, show; output_notebook()
from bokeh.models import (BoxSelectTool, Circle, EdgesAndLinkedNodes, HoverTool,
                          MultiLine, NodesAndLinkedEdges, Plot, Range1d, TapTool,)
from bokeh.plotting import from_networkx

plot = Plot(x_range=Range1d(-1.1,1.1), y_range=Range1d(-1.1,1.1))

graph_renderer = from_networkx(G, nx.spring_layout(G, k=1), scale=1, center=(0,0))

# data
graph_renderer.node_renderer.data_source.data['node_size'] = [max(10,v*40) for v in centrality.values()]
graph_renderer.edge_renderer.data_source.data['edge_size'] = [max(.1, v*10) for v in edge_widths]

# nodes
graph_renderer.node_renderer.glyph = Circle(size='node_size', fill_color='lightblue', line_color='white')
graph_renderer.node_renderer.selection_glyph = Circle(size='node_size', fill_color='mediumpurple', line_color='mediumpurple')
graph_renderer.node_renderer.nonselection_glyph = Circle(size='node_size', fill_color='lightblue', line_color='white')
graph_renderer.node_renderer.hover_glyph = Circle(size='node_size', fill_color='mediumpurple', line_color='mediumpurple')

# edges
graph_renderer.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width='edge_size')
graph_renderer.edge_renderer.selection_glyph = MultiLine(line_color='mediumpurple', line_width='edge_size')
graph_renderer.edge_renderer.hover_glyph = MultiLine(line_color='mediumpurple', line_width='edge_size')

# tools
node_hover_tool = HoverTool(tooltips=[('Label','@label')])
plot.add_tools(node_hover_tool, TapTool())
graph_renderer.selection_policy = NodesAndLinkedEdges()
# graph_renderer.inspection_policy = EdgesAndLinkedNodes()

plot.renderers.append(graph_renderer)

plot.title.text = "Interactive co-occurance map of cognitive tests"
show(plot)

## PyVis plot (interactive)

In [9]:
from pyvis.network import Network

nt = Network(height='1000px', width='1000px', directed=False, notebook=True, heading='Cognitive tests co-occurance map')

nt.from_nx(G)

for i,n in enumerate(nt.nodes):
    n['size'] = max(10, n['centrality'] * 40)
    n['color'] = n['community']

for e in nt.edges:
    e['value'] = e['weight']



pyvis_options = """
var options = {
  "physics": {
    "barnesHut": {
      "gravitationalConstant": 0,
      "centralGravity": 0,
      "springLength": 400,
      "springConstant": 0.1,
      "damping": 1,
      "avoidOverlap": 1
    }
  },
  "nodes": {
    "color": {
      "border": "white",
      "background": "lightblue",
      "highlight": {
        "border": "mediumpurple",
        "background": "mediumpurple"
      }
    }
  },
  "edges": {
    "arrows": {
      "to": {
        "enabled": true
      },
      "from": {
        "enabled": true
      }
    },
    "arrowStrikethrough": false,
    "color": {
      "color": "#dcdcdc88",
      "highlight": "mediumpurple",
      "inherit": false
    },
    "smooth": {
      "type": "continuous",
      "forceDirection": "none"
    }
  }
}
"""

nt.set_options(pyvis_options)
# nt.show_buttons(filter_=['nodes'])

nt.show('../outputs/cooccurances/map_pyvis.html')

### Plotly Sankey (interactive)

In [47]:
import plotly.graph_objects as go

# uncomment to open fig in the browser instead of embedding into the notebook
# import plotly.io as pio; pio.renderers.default = 'browser'


# prep data for sankey
cooc_df = pd.DataFrame(cooccurances).unstack().reset_index().rename(columns={'level_0': 'task_i', 'level_1': 'task_j', 0: 'cooccurance'})
index_to_test_mapping = {i:c for i,c in enumerate(cognitive_tests)}
# df.replace({'task_i': index_to_test_mapping, 'task_j': index_to_test_mapping}, inplace=True)

# only pairs that co-occured at least 50 times
cooc_df = cooc_df.query('cooccurance > 50')


fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 10,
      thickness = 10,
      line = dict(color = 'black', width = 0.5),
      label = cognitive_tests,
      color = 'lightblue'
    ),
    link = dict(
      source = cooc_df['task_i'],
      target = cooc_df['task_j'],
      value = cooc_df['cooccurance']
  ))])

fig.update_layout(title_text='Co-occurance sankey map (only pairs that co-occured in at least 50 articles)', font_size=15)

fig.write_html('../outputs/cooccurances/sankey_map_plotly.html')

fig.show()
