## Network measures

### Local structures

**Indegree**
This is mostly a function of how Wikipedians revised the document and should largely be uniform across pages. The large values are likely pages with 'lists' of links.

**Outdegree**
This is 1st-order measure of an idea's influence.

### Mesoscale structures

**Clustering**
These look equally clustered among the topics.

**Centrality**
This reveals the distribution of sources of ideas within a field.

**Path lengths**

**Rich-club coefficient**

**Modularity**

**Controllability**
This is an nth-order measure of influence.

**Observability**
This is an nth-order measure of the inverse of influence.

**Coreness**
It seems that the more focused a topic is on a subtopic, the stronger the coreness. For example, genetics is heavily focused on DNA, and so it has high coreness. At the same time, in the field of economics, the concept of "economics" has high degree. Yet, it has low coreness because the field itself is heterogeneous, with major subfields such as "macroeconomics" and "microeconomics".

**Characteristic path length**
I'm not sure what path length reveals. Perhaps it is a measure of the heterogeneity in research? It describes how far one idea is to another, topologically. Cognitive science and earth science have ideas that are far away.

In [None]:
%reload_ext autoreload
%autoreload 2
import os,sys
sys.path.insert(1, os.path.join(sys.path[0], '..', 'module'))
import wiki
import numpy as np
import pandas as pd
import networkx as nx

## Load networks

In [None]:
topics = ['anatomy', 'biochemistry', 'cognitive science', 'evolutionary biology',
          'genetics', 'immunology', 'molecular biology', 'chemistry', 'biophysics',
          'energy', 'optics', 'earth science', 'geology', 'meteorology',
          'philosophy of language', 'philosophy of law', 'philosophy of mind',
          'philosophy of science', 'economics', 'accounting', 'education',
          'linguistics', 'law', 'psychology', 'sociology', 'electronics',
          'software engineering', 'robotics']#, 'physics', 'mathematics']

In [None]:
path_saved = '/Users/harangju/Developer/data/wiki/graphs/dated/'
networks = {}
for topic in topics:
    print(topic, end=' ')
    networks[topic] = wiki.Net()
    networks[topic].load_graph(path_saved + topic + '.pickle')

In [None]:
path_saved = '/Users/harangju/Developer/data/wiki/graphs/null-target/'
num_nulls = 2
null_targets = {}
for topic in topics:
    print(topic, end=' ')
    null_targets[topic] = []
    for i in range(num_nulls):
        network = wiki.Net()
        network.load_graph(path_saved + topic + '-null-' + str(i) + '.pickle')
        null_targets[topic].append(network)

## Run analysis

**NOTE:** Skip section if loading stats.

### Basic stats

In [None]:
import bct
from networkx.algorithms.cluster import clustering
from networkx.algorithms import betweenness_centrality
from networkx.convert_matrix import to_numpy_array
pd.options.display.max_rows = 12

In [None]:
measures = {'indegree': lambda g: [x[1] for x in g.in_degree],
            'outdegree': lambda g: [x[1] for x in g.out_degree],
            'clustering': lambda g: list(clustering(g).values()),
            'centrality': lambda g: list(betweenness_centrality(g).values()),
            'path-length': lambda g: [y for x in list(nx.shortest_path_length(g))
                                      for y in list(x[1].values())],
            'char-path-length': lambda g: bct.charpath(to_numpy_array(g))[0],
            'modularity': lambda g: g.graph['modularity'],
            'coreness': lambda g: g.graph['coreness']}

In [None]:
networks['anatomy'].graph.graph

In [None]:
df = pd.DataFrame(columns=['topic','measure','value'])
for topic, network in networks.items():
    print(topic, end=' ')
    df = pd.concat([df] +
                   [pd.DataFrame([[topic, measure, func(network.graph)]],
                                 columns=['topic','measure','value'])
                    for measure, func in measures.items()],
                   ignore_index=True)

In [None]:
for topic, null_networks in null_targets.items():
    print(topic, end=' ')
    for network in null_networks:
        df = pd.concat([df] + 
                       [pd.DataFrame([[topic, measure+'-null', func(network.graph)]],
                                     columns=['topic','measure','value'])
                        for measure, func in measures.items()],
                       ignore_index=True)

In [None]:
df

In [None]:
%time df_expand = df.value\
              .apply(pd.Series)\
              .merge(df, left_index=True, right_index=True)\
              .drop('value', axis=1)\
              .melt(id_vars=['topic','measure'])\
              .drop('variable', axis=1)\
              .dropna()
df_expand

In [None]:
import pickle
pickle.dump((df, df_expand),
            open('/Users/harangju/Developer/data/wiki/analysis/stats.pickle','wb'))

### Load stats

In [None]:
import pickle
comm, df, df_expand = pickle.load(
    open('/Users/harangju/Developer/data/wiki/analysis/stats.pickle', 'rb'))

### Display summary

In [None]:
df_expand.groupby(['topic','measure'])\
         .mean()\
         .reset_index()\
         .pivot(index='topic',columns='measure')

### Plot

* nice plots [seaborn](https://seaborn.pydata.org/examples/index.html)
* interactive [Bokeh](https://bokeh.pydata.org/en/latest/docs/gallery.html#gallery)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='whitegrid', font_scale=2.4)

In [None]:
pd.options.display.max_rows = 12
plt.rcParams.update({'figure.max_open_warning': 0})

In [None]:
save_dir = None
for stat in ['indegree', 'outdegree', 'clustering', 'centrality', 'path-length']:
    f, ax = plt.subplots(figsize=(30, 6))
    sns.violinplot(data=df_expand[(df_expand.measure==stat) |\
                                  (df_expand.measure==stat+'-null')],
                   x='topic', y='value', hue='measure', split=True)
    plt.xticks(np.arange(len(topics)), topics, rotation='vertical')
    plt.ylabel(stat)
    plt.subplots_adjust(bottom=0.2)
    sns.despine(left=True, bottom=True)
    plt.show()
    if save_dir:
        plt.savefig(save_dir, dpi=300)

In [None]:
save_dir = None
for stat in ['coreness', 'modularity', 'char-path-length']:
    f, ax = plt.subplots(figsize=(30, 6))
    sns.scatterplot(data=df_expand[(df_expand.measure==stat) |\
                                   (df_expand.measure==stat+'-null')],
                    x='topic', y='value', hue='measure')
    plt.xticks(np.arange(len(topics)), topics, rotation='vertical')
    plt.ylabel(stat)
    plt.subplots_adjust(bottom=0.2)
    sns.despine(left=True, bottom=True)
    plt.show()
    if save_dir:
        plt.savefig(path_saved + stat, dpi=300)

In [None]:
save_dir = None
f, axs = plt.subplots(ncols=2, figsize=(10,5))
f.tight_layout()
for i, stat in enumerate(['coreness', 'modularity']):
    x = df_expand[df_expand.measure==stat+'-null']\
        .groupby('topic').mean().value.values
    y = df_expand[df_expand.measure==stat].value.values
    sns.scatterplot(x=x, y=y, ax=axs[i], marker='x')
    z = np.concatenate((x,y))
    sns.lineplot(x=[min(z), max(z)], y=[min(z), max(z)], ax=axs[i])
    axs[i].set_title(stat)
    if save_dir:
        plt.savefig(path_saved + stat, dpi=300)
axs[0].set(xlabel='null', ylabel='real', aspect='equal')
axs[1].set(xlabel='null', aspect='equal');

### Measures in growing networks

In [None]:
comm_t = pd.DataFrame()
for topic, network in networks.items():
    print(topic, end=' ')
    comm_t = pd.concat([comm_t] +
                       [pd.DataFrame([[topic,
                                       node,
                                       network.graph.nodes[node]['year'],
                                       network.graph.nodes[node]['community'],
                                       network.graph.nodes[node]['core'],
                                       1]],
                                     columns=['topic','node','year',
                                              'comm','core','count'])
                        for node in network.graph.nodes],
                       ignore_index=True)
comm_t = comm_t.merge(comm_t.groupby(['topic','comm'])['count'].sum(),
                      on=['topic','comm'],
                      suffixes=('','_topic_comm'))\
               .merge(comm_t.groupby(['topic','core'])['count'].sum(),
                      on=['topic','core'],
                      suffixes=('','_topic_core'))\
               .sort_values(by=['topic','year'])\
               .reset_index(drop=True)
comm_t['comm_count'] = comm_t.groupby(['topic','comm'])['count']\
                             .transform(pd.Series.cumsum)
comm_t['core_count'] = comm_t.groupby(['topic','core'])['count']\
                             .transform(pd.Series.cumsum)
comm_t['comm_frac'] = comm_t['comm_count']/comm_t['count_topic_comm']
comm_t['core_frac'] = comm_t['core_count']/comm_t['count_topic_core']
comm_t = comm_t.drop(['count','count_topic_comm','count_topic_core'], axis=1)
comm_t

### Growth in core-periphery & modules

In [None]:
for topic in networks.keys():
    plt.figure(figsize=(20,6))
    sns.lineplot(x='year', y='comm_count', hue='comm',
                 data=comm_t[comm_t.topic==topic])
    plt.title(topic)
    plt.xlim((0,2030))
    plt.figure(figsize=(20,6))
    sns.lineplot(x='year', y='comm_frac', hue='comm',
                 data=comm_t[comm_t.topic==topic])
    plt.xlim((0,2030))
    plt.show()

In [None]:
for topic in networks.keys():
    fig = plt.figure(figsize=(20,6))
    sns.lineplot(x='year', y='core_count', hue='core',
                 data=comm_t[comm_t.topic==topic])
    plt.title(topic)
    fig = plt.figure(figsize=(20,6))
    sns.lineplot(x='year', y='core_frac', hue='core',
                 data=comm_t[comm_t.topic==topic])
    plt.title(topic)
    plt.xlim((0,2030))
    plt.show()

### Birth: core vs. periphery

In [None]:
birth = pd.concat([pd.DataFrame([[comm_t.iloc[i].topic,
                                  comm_t.iloc[i].node,
                                  comm_t.iloc[i].year,
                                  [c for c in 
                                   list(networks[comm_t.iloc[i].topic]\
                                        .graph.successors(comm_t.iloc[i].node)) + 
                                   list(networks[comm_t.iloc[i].topic]\
                                        .graph.predecessors(comm_t.iloc[i].node))
                                   if networks[comm_t.iloc[i].topic].graph.nodes[c]['core']]
                                 ]],
                                columns=['topic','periphery','year','cores'])
                   for i in range(len(comm_t.index))
                   if not comm_t.iloc[i].core],
                  ignore_index=True)
birth

In [None]:
birth_exp = birth.cores.apply(pd.Series)\
                 .merge(birth, left_index=True, right_index=True)\
                 .drop(['cores'], axis=1)\
                 .melt(id_vars=['topic','periphery','year'], value_name='core')\
                 .drop('variable', axis=1)\
                 .dropna()\
                 .sort_values(by=['topic','year', 'periphery'])\
                 .reset_index(drop=True)
birth_exp['core_year'] = [networks[birth_exp.iloc[i].topic].graph\
                          .nodes[birth_exp.iloc[i].core]['year']
                          for i in range(len(birth_exp.index))]
birth_exp

In [None]:
for topic in networks.keys():
    fig = plt.figure(figsize=(4,4))
    ax = sns.scatterplot(x='year', y='core_year', marker='.',
                         data=birth_exp[birth_exp.topic==topic])
    ax.set(xlabel='periphery', ylabel='core', aspect='equal')
    z = np.concatenate((birth_exp[birth_exp.topic==topic].year.values,
                        birth_exp[birth_exp.topic==topic].core_year.values))
    sns.lineplot(x=[min(z), max(z)], y=[min(z), max(z)], ax=ax)
    plt.title(topic)