# Gateway to Research - Topic Evolution

In this notebook we measure the K factor of publication from ArXiv that have been tagged with _Field of Study_ (FoS) labels from Microsoft Academic Graph. 

## Preamble

In [None]:
%load_ext autoreload
%autoreload 2
# install im_tutorial package
!pip install git+https://github.com/nestauk/im_tutorials.git

In [None]:
from collections import Counter, defaultdict
import itertools

# matplotlib for static plots
import matplotlib.pyplot as plt
# numpy for mathematical functions
import numpy as np
# pandas for handling tabular data
import pandas as pd

from im_tutorials.utilities import chunks
from im_tutorials.data import datasets

In [None]:
import ast
import json
from collections import Counter
from gensim.corpora.dictionary import Dictionary
from collections import defaultdict
from itertools import combinations, chain

pd.set_option('max_columns', 99)

In [None]:
import configparser
from sqlalchemy.engine.url import URL
from sqlalchemy.engine import create_engine

In [None]:
from gensim.corpora import Dictionary
from rhodonite.cooccurrence.basic import cooccurrence_graph
from rhodonite.cooccurrence.cumulative import cumulative_cooccurrence_graph
from rhodonite.cooccurrence.normalise import association_strength

from gensim.models import LdaModel
from annoy import AnnoyIndex
from gensim.sklearn_api.ldamodel import LdaTransformer

import graph_tool as gt
from graph_tool.draw import graph_draw
from sklearn.decomposition import TruncatedSVD

import networkx as nx
import community

from nesta.packages.nlp_utils.ngrammer import Ngrammer
from gensim.corpora import Dictionary
from gensim.models.word2vec import Word2Vec

In [None]:
from scipy.spatial.distance import cosine
from sklearn.manifold import TSNE

## Data

### Load Data

In [None]:
gtr_projects = datasets.gateway_to_research_projects()

In [None]:
df = pd.read_csv('../data/raw/gtr/gtr_projects.csv')

### Clean Data

We're going to focus on data from just one funder, the EPSRC.

After selecting EPSRC projects, we remove those with abstracts that have less than 200 characters. We then remove projects from before 2004 and after 2017 as there are few of these.

In [None]:
df = df[df['funder_name'] == 'EPSRC']
df = df[(df['start_year'] > 2004) & (df['start_year'] < 2018)]
df = df.sort_values('start_year')
df = df.reset_index()

In [None]:
fig, ax = plt.subplots()
ax.hist(df['abstract_texts'].str.len(), bins=100)
ax.set_title('Abstract Lengths')
ax.set_xlabel('Number of Characters')
ax.set_ylabel('Frequency');

After seeing the abstract lengths, we will drop any that are very short.

In [None]:
df = df[df['abstract_texts'].str.len() >= 300]
# df = df.drop_duplicates('abstract_texts')

## Natural Language Processing

### Text Preprocessing

In [None]:
from im_tutorials.features.text_preprocessing import *
from itertools import chain

#### Tokenisation

Typically, for computers to understand human language, it needs to be broken down in to components, e.g. sentences, syllables, or words.

In the case of this work, we are going to analyse text at the word level. In natural language processing, the componenets below the sentence level are called **tokens**. The process of breaking a piece of text into tokens is called **tokenisation**. A token could be a word, number, email address or punctuation, depending on the exact tokenisation method used.

For example, tokenising the  `'The dog chased the cat.'` might give `['The', 'dog', 'chased', 'the', 'cat', '.']`.

In this case we will apply some extra processing during the tokenisation phase. We will

1. Tokenise each document at the word level.
2. Remove punctuation.
3. Remove **stop words**, such as `the`, `and`, `to` etc.
4. Apply lower case to all tokens.

In [None]:
tokenized = [list(chain(*tokenize_document(document))) for document in df['abstract_texts'].values]

In [None]:
print('Original text of first document:')
print(df['abstract_texts'].values[0], '\n')

n_tokens_print = 10
print(f'First {n_tokens_print} tokens in first document:')
print(tokenized[0][:n_tokens_print])

#### Lemmatisation

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

In [None]:
wnl = WordNetLemmatizer()

In [None]:
lemmas = [[wnl.lemmatize(t) for t in b] for b in tokenized]

#### N Grams

We know that some in some cases, we might have words that appear together more often than we might expect. This might happen where we have commonly used phrases, or names of entities, for example `general relativity`. It can be useful to identify cases of this in our text so that the machine can understand that they represent different information when compared to the words appearing separately. Tokens of multiple words are called **n grams**. N grams containing two tokens are **bigrams**, n grams containing three words are **trigrams** and so on.

For example, in a corpus of text, we might have the sentence, `'I travelled from York to New York to find a new life.'`. After tokenisation and finding bigrams, we might end up with `['i', 'travelled', 'from', 'york', 'to', 'new_york', 'to', 'find', 'a', 'new', 'life', '.']`.

To create bigrams, we are going to use the natural language processing module **`gensim`**.

In [None]:
from gensim.models.phrases import Phraser, Phrases

In [None]:
# only find ngrams that appear 10 times or more
phrases = Phrases(lemmas, min_count=10)
phraser = Phraser(phrases)

bigrammed = [phraser[t] for t in lemmas]


print('Number of unique bigrams identified:', len(phraser.phrasegrams))
print('Bigram examples:')
print([b[0] for b in phraser.phrasegrams.items() if np.random.random() > 0.999], '\n')
n_tokens_print = 20
print(f'First {n_tokens_print} tokens in first document after finding bigrams:')
print(bigrammed[0][:n_tokens_print])

In [None]:
n_tokens_print = 20

n_bigrams = len(phraser.phrasegrams)
print('Number of unique bigrams identified:', len(phraser.phrasegrams), '\n')
print(f'{n_tokens_print} randomly selected bigrams:')
print(['_'.join([s.decode() for s in b[0]]) 
       for b in phraser.phrasegrams.items() if np.random.random() > 1 - n_tokens_print*(1 / n_bigrams)], '\n')

print(f'First {n_tokens_print} tokens in first document after finding bigrams:')
print(bigrammed[0][:n_tokens_print])

#### High Frequency Terms

As well as stop words and punctuation, there may be other words that we want to remove, which are unique to our corpus. Often these are the tokens which appear very often and therefore convey little distinguishing information about each document.

Let's count up all of the tokens in our processed corpus and see which are the most common.

In [None]:
token_counts = Counter(chain(*bigrammed))
token_counts.most_common(30)

We are going to remove terms that occurr too often in our corpus. To do this, we need to pick a threshold value. A convenient way to do this is to use a max document frequency. In this case, we will say that if a token has appeared more times than a certain percentage of all the documents, it will be removed.

In [None]:
max_doc_frequency = 0.4
abstracts_tokenized = [[t for t in d if token_counts[t] < df.shape[0] * max_doc_frequency] for d in bigrammed]

### From Human to Computer Language

Once we have preprocessed our text, we can apply various NLP techniques to further process, analyse, summarise the text, extract information from it, or use it as features in a later analysis.

#### Bag of Words

In general, when dealing with text, we need to somehow convert it in to numeric data that can be processed and analysed using mathematics. A very simple example would be to count the number of times each token appears in a document. For example if we have the sentence `'I like really cute cats, but all cats are cute really.'`, after pre-processing and tokenisation, we could generate a vector of word counts where each position represents the token count:

```
vector      token
[1,         i
 1,         like
 2,         really
 2,         cute
 2,         cats
 1,         but
 1,         all
 1,]        are
```

This method is called the **bag of words** approach, and in this case we can determine that the document is about really cute cats. But in real life, with many documents, things are not always so straightfoward.

- What are some potential limitations of bag of words?

In [None]:
from gensim.corpora import Dictionary

In [None]:
dictionary = Dictionary(abstracts_tokenized)
bow = [dictionary.doc2bow(d) for d in abstracts_tokenized]

In [None]:
doc_id = 0
print(f'Bag of words token frequencies for document {doc_id}:')
print([(dictionary[b[0]], b[1]) for b in sorted(bow[doc_id], key=lambda x: x[1], reverse=True)])

#### TF-IDF

An improvement on the simple bag of words is to somehow weight each token by it's importance, or how much information it carries. One way to to do this is by weighting the count of each word in a document with the inverse of its frequency across _all_ documents. This is called **term frequency-inverse document frequency** or **tf-idf**.

By doing this, a reasonably common word like `'height'` would probably be weighted lower than a less common, but more specific term such as `'altitude'`. Even if we have a document where height is mentioned more frequently than altitude, tf-idf can help us to identify that the document is referring to height in the context of altitude, rather than for example the height of a person.

We will continue to use gensim for this functionality.

In [None]:
from gensim.models.tfidfmodel import TfidfModel

In [None]:
tfidf = TfidfModel(bow, id2word=dictionary)

In [None]:
doc_id = 0
print([(dictionary[b[0]], '{:0.2f}'.format(b[1])) 
       for b in sorted(tfidf[bow[doc_id]], key=lambda x: x[1], reverse=True)])

We can now see that terms that are much more specific are weighted relatively higher than those which convey higher level and more generic information.

#### Topic Modelling

In this case, we have thousands of documents, which is too many for a single person to read and understand in a reasonable space of time. A useful first step is often to be able to understand what the main themes are within the documents we have. Bag of words or tf-idf are useful processing methods, but they still require us to inspect each document individually or group them and identify topics manually. 

Luckily, there are automated methods of finding the groups of tokens that describe broad themes within a set of documents, which are referred to as **topic modelling**.

In this case, we are going to use **Latent Dirichlet Allocation** or **LDA**.

In [None]:
from gensim.models.ldamodel import LdaModel
from gensim.sklearn_api import LdaTransformer

One aspect of many topic modelling methods, is that you have to specify the number of topics you expect in advance.

- What are the disadvantages of this?

In [None]:
num_topics = 300

lda = LdaModel(corpus=bow, id2word=dictionary, num_topics=num_topics)

Let's print out some random topics and have a look at how coherent they are.

In [None]:
n_topics_print = 10

for topic_id in range(0, num_topics, int(num_topics/n_topics_print)):
    print('Topic', topic_id)
    print(lda.print_topic(topic_id), '\n')

#### Document Vectors

In [None]:
lda_transformer = LdaTransformer(num_topics=num_topics, id2word=dictionary)
lda_transformer.gensim_model = lda
lda_vecs = lda_transformer.transform(bow)

In [None]:
# def get_topic_terms(topic_id, model, num_topics=None):
#     num_topics = num_topics + 1
#     topic_terms = [model.id2word[t[0]] for t 
#                    in model.get_topic_terms(topic_id)[:num_topics]]
#     return topic_terms

def make_topic_terms(model, num_topic_terms):
    topic_terms = []
    for i in range(model.num_topics):
        topic_terms.append([model.id2word[t[0]] for t 
                   in model.get_topic_terms(i)[:num_topic_terms]])
    return np.array(topic_terms)

def make_topic_names(topic_vectors, topic_terms, num_topics=None):
    topic_names = []
    for vector in topic_vectors:
        topic_ids = np.argsort(vector)[::-1][:num_topics]
        name = ', '.join([c for c in chain(*topic_terms[topic_ids])])
        topic_names.append(name)
    return topic_names
#     name = []
#     for topic_id in topic_ids:
#         topic_terms = get_topic_terms(topic_id, model, num_topic_terms)
#         name.extend(topic_terms)
#     return name

In [None]:
topic_terms = make_topic_terms(lda, 2)

In [None]:
doc_id = 0
print(make_topic_names([lda_vecs[doc_id]], topic_terms, num_topics=3))

In [None]:
topic_names = make_topic_names(lda_vecs, topic_terms, num_topics=3)

#### Visualising Topics

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
n_components = 30
svd = TruncatedSVD(n_components=n_components)
svd.fit(lda_vecs)
svd_vecs = svd.transform(lda_vecs)

In [None]:
tsne = TSNE(n_components=2)
tsne_vecs = tsne.fit_transform(svd_vecs)

In [None]:
n_clusters = 30
kmm = KMeans(n_clusters=n_clusters)
kms = kmm.fit_predict(svd_vecs)

In [None]:
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import Plasma256

output_notebook()

In [None]:
norm = matplotlib.colors.Normalize(vmin=np.min(kms), vmax=np.max(kms))
colors = [matplotlib.cm.colors.to_hex(cmap(norm(i))) for i in kms]
cds = ColumnDataSource(data={'tsne_0': tsne_vecs[:, 0],
                             'tsne_1': tsne_vecs[:, 1],
                             'name': topic_names,
                             'color': colors,
                             'cluster': kms})

p = figure(width=900)
hover = HoverTool(tooltips=[("Topic", "@name"), ("Cluster", "@cluster")])
p.circle(source=cds, x='tsne_0', y='tsne_1', fill_color='color', line_color='color', 
         fill_alpha=0.5, line_alpha=0.5, radius=.5)
p.add_tools(hover)

show(p)

In [None]:
from annoy import AnnoyIndex

In [None]:
annoy_indices = {}
for year, group in df.groupby(['start_year']):
    ids = group.index.values

    vecs = svd_vecs[ids]
    t = AnnoyIndex(svd.n_components, 'angular')  # Length of item vector that will be indexed
    for idx, vec in zip(ids, vecs):
        t.add_item(idx, vec)
    t.build(500)
    annoy_indices[year] = t

In [None]:
years = df['start_year'].unique()

In [None]:
min_dist = 0.8

project_edges = defaultdict(list)

for year, group in df.groupby(['start_year']):
    edges_year = []
    ids = group.index.values
    annoy_index = annoy_indices[year]
    for idx in ids:
        for neighbour_idx in annoy_index.get_nns_by_item(idx, 30):
            if neighbour_idx == idx:
                continue
            else:
                dist = annoy_index.get_distance(neighbour_idx, idx)
                if dist < min_dist:
                    edges_year.append((idx, neighbour_idx, {'dist': 1 - dist}))
    project_edges[year].extend(edges_year)

In [None]:
import networkx as nx

In [None]:
g_p = nx.Graph()
g_p.add_edges_from(project_edges[2016])

In [None]:
g_p_node_pos = nx.spring_layout(g_p, seed=101, weight='dist')
nx.draw(g_p, pos=g_p_node_pos, node_size=15, node_color='C0')

In [None]:
import community
import matplotlib

In [None]:
communities = community.best_partition(g_p, resolution=0.3, weight='dist')

In [None]:
nx.draw(g_p, pos=g_p_node_pos, node_size=15, node_color=list(communities.values()), cmap=matplotlib.cm.hsv)

In [None]:
resolution = 0.3

project_communities = {}
community_labels = {}
for year, edge_list in project_edges.items():
    g = nx.Graph()
    g.add_edges_from(edge_list)
    project_graphs[year] = g
    
    communities = community.best_partition(g, resolution=resolution, weight='dist')
    print(f'N Communities at {year}:', len(set(communities.values())))
    
    community_ids = defaultdict(list)
    for proj, c in communities.items():
        community_ids[c].append(proj)
    project_communities[year] = community_ids

In [None]:
lda_communities = {}

for year, communities_year in project_communities.items():
    lda_communities_year = []
    for community_id, docs in communities_year.items():
        mean_vec = np.mean(lda_vecs[docs], axis=0)
        mean_vec = mean_vec / np.max(mean_vec)
        lda_communities_year.append(mean_vec)
    lda_communities[year] = lda_communities_year

In [None]:
svd_communities = {}

for year, communities_year in project_communities.items():
    svd_communities_year = []
    for community_id, docs in communities_year.items():
        mean_vec = np.mean(svd_vecs[docs], axis=0)
        mean_vec = mean_vec / np.max(mean_vec)
        svd_communities_year.append(mean_vec)
    svd_communities[year] = svd_communities_year

In [None]:
from scipy.spatial.distance import cosine

In [None]:
n_neighbours = 3

communities_edges = {}

for year, vecs in lda_communities.items():
    edges = []
    for i, vec in enumerate(vecs):
        similarities = [1 - cosine(vec, v) for v in vecs]
        neighbours = np.argsort(similarities)[::-1][1:n_neighbours+1]
        for n in neighbours:
            edge = (i, n)
            edges.append(edge)
    communities_edges[year] = edges

In [None]:
h = nx.Graph()
h.add_edges_from(communities_edges[2017])

In [None]:
communities = community.best_partition(h)

In [None]:
nx.draw(h, node_color=list(communities.values()))

In [None]:
similarity_thresh = 0.8

agg_edges = []
max_parents = 1

for i, year in enumerate(years):
    if i > 0:
        past_year = year - 1
        past_vecs = svd_communities[past_year]
        current_vecs = svd_communities[year]
        for idx, vec in enumerate(current_vecs):
            similarities = [1 - cosine(vec, c_past) for c_past in past_vecs]
            sim_max_ids = np.argsort(similarities)[::-1][:max_parents]
            for sim_max_idx in sim_max_ids:
                edge = (f'{year}_{idx}', f'{past_year}_{sim_max_idx}', {'weight': similarities[sim_max_idx]})
            agg_edges.append(edge)      

In [None]:
nodes = []
for year, communities in project_communities.items():
    for idx, _ in enumerate(communities):
        nodes.append(f'{year}_{idx}')

In [None]:
plt.hist([e[2]['weight'] for e in agg_edges], bins=50);

In [None]:
h = nx.DiGraph()
h.add_nodes_from(nodes)
h.add_edges_from(agg_edges)
# h.add_edges_from([e for e in agg_edges if e[2]['weight'] > 0.5])

In [None]:
from sklearn.manifold import TSNE

In [None]:
pos_x = np.array([int(d.split('_')[0]) for d in h.nodes])
pos_x = pos_x - np.max(pos_x)

tsne_agg = TSNE(n_components=1)
svd_df = pd.DataFrame(np.array(list(chain(*svd_communities.values()))))
pos = tsne_agg.fit_transform(svd_df)

pos_y = pos_y - np.min(pos_y) 
pos_y = pos_y / np.max(pos_y)

pos = {}
for node, x, y in zip(h.nodes, pos_x, pos_y):
    pos[node] = (x, y[0])

In [None]:
tsne_xy = TSNE(n_components=2)
pos_xy = tsne_xy.fit_transform(list(chain(*svd_communities.values())))

In [None]:
n_clusters = int(np.round(np.mean([len(c) for c in svd_communities.values()])))

from sklearn.cluster import KMeans

km = KMeans(n_clusters=n_clusters)
km.fit(list(chain(*svd_communities.values())))
colors = km.labels_
cmap_nodes = matplotlib.cm.hsv

In [None]:
plt.scatter(pos_xy[:, 0], pos_xy[:, 1], c=colors, cmap=cmap_nodes)

In [None]:
weights = np.array([1 / h.get_edge_data(e[0], e[1])['weight'] for e in h.edges])
weights = weights / np.max(weights)

In [None]:
import matplotlib

cmap = matplotlib.cm.get_cmap('inferno')

In [None]:
fig, ax = plt.subplots(figsize=(15, 7.5))
nx.draw(h, pos=pos, node_size=50, edge_color=weights, edge_cmap=cmap, width=2, node_color=colors, cmap=cmap_nodes)

### Topic Names

In [None]:
def get_topic_terms(topic_id, model, num_topics=None):
    topic_terms = [lda.id2word[t[0]] for t in lda.get_topic_terms(topic_id)]
    if num_topics is not None:
        topic_terms = topic_terms[:num_topics+1]
    return topic_terms
    
def make_topic_name(topic_vector, model, num_topics=None, num_topic_terms=None):
    topic_ids = np.argsort(topic_vector)[::-1][:num_topics]
    name = []
    for topic_id in topic_ids:
        topic_terms = get_topic_terms(topic_id, model, num_topic_terms)
        name.extend(topic_terms)
    return name

In [None]:
for i, k in enumerate(lda_communities[2006]):
    print('Community', i)
    print(', '.join(make_topic_name(k, lda, num_topics=5, num_topic_terms=2)), '\n')

In [None]:
node_topics = [', '.join(make_topic_name(t, lda, num_topics=5, num_topic_terms=1)) for t in chain(*svd_years)]

In [None]:
node_topic_attrs = {node_id: topic for node_id, topic in zip(h.nodes, node_topics)}

In [None]:
node_topic_attrs['2011_12']

In [None]:
nx.set_node_attributes(h, node_topic_attrs, name='topic_name')

In [None]:
from bokeh.models.graphs import from_networkx
from bokeh.models import HoverTool

In [None]:
plot = figure(title="Networkx Integration Demonstration", 
              x_range=(np.min(pos_x) - .5, np.max(pos_x) + .5), 
              y_range=(np.min(pos_y) - .1, np.max(pos_y) + .1),
              width=900
             )
node_hover_tool = HoverTool(tooltips=[("Topic", "@topic_name")])
# Put everything on the plot.
plot.add_tools(node_hover_tool)
graph = from_networkx(h, pos, scale=2, center=(0,0))
plot.renderers.append(graph)

show(plot)

## Topic Co-Occurrence Network

In [None]:
plt.hist(np.max(lda_vecs, axis=1), bins=50);

In [None]:
np.percentile(np.mean(lda_vecs, axis=1), 90)

In [None]:
from rhodonite.cooccurrence.basic import cooccurrence_graph
from rhodonite.cooccurrence.normalise import association_strength
from graph_tool import GraphView
from graph_tool.topology import label_largest_component

In [None]:
topic_co_graphs = []

for communities in community_labels:
    main_topics = []
    for c, ids in communities.items():
        for vec in lda_vecs[ids]:
            main_topics.append(np.nonzero(vec > 0)[0])
    g, o_vprop, co_eprop = cooccurrence_graph(main_topics)
    a_s = association_strength(g, o_vprop, co_eprop)
    g.ep['a'] = a_s
    vb, eb = betweenness(g)
    g.vp['betweenness'] = vb
    thresh = np.percentile(a_s.a, 90)
    gv = GraphView(g, efilt=a_s.a > thresh)
    topic_co_graphs.append(gv)

In [None]:
co_topics = []
for vec in lda_vecs:
    co_topics.append(np.nonzero(vec > 0)[0])
g, o_vprop, co_eprop = cooccurrence_graph(co_topics)
a_s = association_strength(g, o_vprop, co_eprop)
g.ep['a'] = a_s
thresh = np.percentile(a_s.a, 90)
gv = GraphView(g, efilt=a_s.a > thresh)

In [None]:
from graph_tool.draw import fruchterman_reingold_layout

In [None]:
pos = fruchterman_reingold_layout(gv)

In [None]:
for gv in topic_co_graphs:
    pos_this = gv.new_vertex_property('vector<double>')
    pos_this.set_2d_array(pos.get_2d_array((0, 1)))
    graph_draw(gv, vertex_fill_color=gv.vp['betweenness'], vcmap=matplotlib.cm.viridis, output_size=(200, 200),
              pos=pos_this)

In [None]:
a = np.empty((len(years), lda.num_topics,))
# a[:] = np.nan

for i, (year, graph) in enumerate(zip(years, topic_co_graphs)):
#     max_bs = np.argsort(graph.vp['betweenness'].a)[::-1][:10]
    a[i, :] = graph.vp['betweenness'].a

In [None]:
max_bs = np.argsort(np.max(a, axis=0))[::-1][:10]

b = np.zeros((len(years), lda.num_topics,))
# b[:] = np.nan

for i, (year, graph) in enumerate(zip(years, topic_co_graphs)):
    for m in max_bs:
        b[i, m] = graph.vp['betweenness'][m]
    b[i, :] = b[i, :] / np.max(b[i, :])

In [None]:
df_betweenness = pd.DataFrame(b)
df_betweenness = df_betweenness[max_bs]

In [None]:
df_betweenness.head(1)

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure, ColumnDataSource
from bokeh.palettes import Category10_10

output_notebook()

In [None]:
df_betweenness.columns = [str(i) for i in df_betweenness.columns]

In [None]:
cds = ColumnDataSource.from_df(df_betweenness.rolling(3).mean())
p = figure(width=450, height=300)
for c, color in zip(df_betweenness.columns, Category10_1010):
    p.line(x='index', y=c, source=cds, color=color)
    p.circle(x='index', y=c, source=cds, color=color)
show(p)

In [None]:
thresh = np.percentile(a_s.a, 50)
gv = GraphView(g, efilt=a_s.a > thresh)

In [None]:
# d, c = eigenvector(gv)
# c = closeness(gv)
vb, eb = betweenness(gv)

In [None]:
plt.hist(vb.a, bins=30);

In [None]:
js = np.argsort(vb.a)[::-1][:10]

In [None]:
for b in max_bs:
    print(lda.print_topic(b), '\n==========')

In [None]:
plt.hist(a_s.a, bins=100);

In [None]:
gv = GraphView(g, efilt=a_s.a > np.percentile(a_s.a, 95))

In [None]:
l = label_largest_component(gv)
graph_draw(GraphView(gv, vfilt=l))

In [None]:
from graph_tool.centrality import eigenvector, closeness, betweenness

In [None]:
graph_draw(topic_co_graphs[0], vertex_fill_color=vb, vcmap=matplotlib.cm.viridis)

In [None]:
eigenvector??

In [None]:
ev, ev_vprop = eigenvector(g, weight=a_s)

In [None]:
c = closeness(g, weight=a_s)

In [None]:
plt.hist(c.a)

In [None]:
graph_draw(GraphView(gv, vfilt=l), vertex_fill_color=ev_vprop, vcmap=matplotlib.cm.gist_heat)

**Ideas**

- Show changing topic centrality in topic coocurrence network
- Average distance of each node to all nodes in previous year (plot as scatter)
- Distance of each node to each other node in previous years (find which ones really are a combo and which ones are mono-topical
- Which topics have been most stable over time? Which are not?
- Calculate average centrality of communities over time

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components=2)
tsne_vecs = tsne.fit_transform(svd_vecs[list(g.nodes)])

In [None]:
import matplotlib.pyplot as plt

In [None]:
ax = nx.draw(g, node_color=list(communities.values()), cmap='hsv', node_size=15, width=0.1)

In [None]:
df_old.groupby('start_year')['project_id'].count()

In [None]:
graphs = []

for i, (year, group) in enumerate(df_old.groupby(['start_year'])):
    ids = group.index.values
    vertex_ids_map = {idx: vertex for vertex, idx in enumerate(ids)}
    n_vertices = np.max(list(vertex_ids_map.values())) - 1
    g = gt.Graph(directed=False)
    g.add_vertex(n_vertices)
    edge_list = edges[i]
    edges_updated = []
    for edge in edge_list:
        edges_updated.append((vertex_ids_map[edge[0]], vertex_ids_map[edge[1]], edge[2]))
    g.vp['doc_s'] = g.new_vertex_property('int')
    g.vp['doc_t'] = g.new_vertex_property('int')
    g.ep['dist'] = g.new_edge_property('float')
    g.add_edge_list(edges_updated, eprops=[g.ep['dist']])
    gt.stats.remove_parallel_edges(g)
    graphs.append(g)

In [None]:
graphs

In [None]:
from rhodonite.utils.tabular import edges_to_dataframe

In [None]:
edge_df = edges_to_dataframe(graphs[-1])

In [None]:
graph_draw(graphs[-4])

### Data at a Glance

In [None]:
df.head(2)

In [None]:
df_mag_fos.head(2)

In [None]:
df_fos.head(2)

In [None]:
df_cat.head(2)

# All ArXiv

In [None]:
df['year'] = pd.to_datetime(df['created']).dt.year

### Level 2 FoS

In [None]:
df_mag_fos_2 = df_mag_fos[df_mag_fos['level'] == 2]
df_fos_2 = df_fos.merge(df_mag_fos_2, left_on='fos_id', right_on='id', how='inner')
article_fos_2 = df_fos_2.groupby('article_id')['name'].apply(list)
article_fos_2 = article_fos_2.reset_index()
article_fos_2['n_fos'] = [len(x) for x in article_fos_2['name']]
article_fos_2 = article_fos_2[article_fos_2['n_fos'] > 1]

In [None]:
df_2 = df.merge(article_fos_2, left_on='id', right_on='article_id', how='inner')
df_2 = df_2.sort_values('year')
df_2 = df_2[df_2['year'] >= 1992]

In [None]:
dictionary = Dictionary(df_2['name'])
df_2['fos_ids'] = [dictionary.doc2idx(d) for d in df_2['name']]
df_2_years = df_2.groupby('year')['fos_ids'].apply(list)

In [None]:
g, o_props, o_cumsum_props, co_props, co_cumsum_props = cumulative_cooccurrence_graph(df_2_years.index.values, df_2_years.values)

In [None]:
from rhodonite.utils.graph import subgraph_eprop_values

In [None]:
subgraph_eprop_values??

In [None]:
term = 'Climate change'
node = dictionary.token2id[term]
vertices = [int(i) for i in g.vertex(3386).out_neighbours()]
# vertices.append(node)

In [None]:
def get_subgraph(g, vertices):
    idx = g.new_vertex_property('bool')
    idx.a[vertices] = True
    gv = GraphView(g, vfilt=idx)
    return gv

In [None]:
gv = get_subgraph(g, vertices)
gv = GraphView(gv, efilt=co_props[2000].a > 0)
l = label_largest_component(gv)
gv = GraphView(gv, vfilt=l)

In [None]:
from graph_tool.inference.minimize import minimize_nested_blockmodel_dl

In [None]:
state = minimize_nested_blockmodel_dl(gv)

In [None]:
state.draw()

In [None]:
graph_draw(gv)

### Level 0 FoS

In [None]:
df_mag_fos_0 = df_mag_fos[df_mag_fos['level'] == 0]
df_fos_0 = df_fos.merge(df_mag_fos_0, left_on='fos_id', right_on='id', how='inner')
article_fos_0 = df_fos_0.groupby('article_id')['name'].apply(list)
article_fos_0 = article_fos_0.reset_index()
article_fos_0['n_fos'] = [len(x) for x in article_fos_0['name']]
article_fos_0 = article_fos_0[article_fos_0['n_fos'] > 1]

In [None]:
df_0 = df.merge(article_fos_0, left_on='id', right_on='article_id', how='inner')
df_0 = df_0.sort_values('year')
df_0 = df_0[df_0['year'] >= 1992]

In [None]:
dictionary_0 = Dictionary(df_0['name'])
df_0['fos_ids'] = [dictionary.doc2idx(d) for d in df_0['name']]
df_0_years = df_0.groupby('year')['fos_ids'].apply(list)

In [None]:
df_0_years = df_0_years[df_0_years.index > 1998]

In [None]:
g0, o_props_0, o_cumsum_props_0, co_props_0, co_cumsum_props_0 = cumulative_cooccurrence_graph(df_0_years.index.values, df_0_years.values)

In [None]:
g, o_vprop, co_eprop = cooccurrence_graph(fos_ids)

In [None]:
a_s = association_strength(g, o_vprop, co_eprop)

In [None]:
from graph_tool.draw import graph_draw
from graph_tool.topology import label_largest_component
from graph_tool import GraphView

In [None]:
l = label_largest_component(g)
gv = GraphView(g, vfilt=l, efilt=a_s.a > 10)
l = label_largest_component(g)


In [None]:
gv

In [None]:
graph_draw(gv)

In [None]:
df['year_created'] = pd.to_datetime(df['created']).dt.year

In [None]:
df = df[(df['year_created'] > 1991) & (df['year_created'] < 2019)]

In [None]:
article_fos = df_fos.groupby('article_id')['fos_id'].apply(list)

In [None]:
df.head()

### Selecting HEP Articles and FoS

In [None]:
df['year_created'] = pd.to_datetime(df['created']).dt.year

In [None]:
df = df[(df['year_created'] > 1991) & (df['year_created'] < 2019)]

In [None]:
article_fos = df_fos.groupby('article_id')['fos_id'].apply(list)

In [None]:
# get high energy physics category ids
hep_cats = [a for a in df_cat['category_id'].value_counts().index if 'hep' in a]
# get unique article ids for papers with hep categories
hep_article_ids = df_cat.set_index('category_id').loc[hep_cats]['article_id'].unique()
# get fields of study for hep papers
hep_fos = df_fos.set_index('article_id').loc[hep_article_ids]
hep_fos.reset_index(inplace=True)

In [None]:
fos_id_2_level_map = {i: l for i, l in zip(df_mag_fos['id'].values, df_mag_fos['level'].values)}

In [None]:
Counter(fos_id_2_level_map.values())

In [None]:
article_fos_seqs = []
for article_id, group in hep_fos.groupby('article_id'):
    fos_ids = [f for f in group['fos_id'].values if not np.isnan(f)]
    fos_ids = [f for f in fos_ids if  fos_id_2_level_map[f] > 2]
    if len(fos_ids) > 2:
        article_fos_seqs.append((article_id, fos_ids))

In [None]:
def filter_fos(fos_list, level=2):
    filtered = []
    for fos in fos_list:
        if np.isnan(fos):
            continue
        elif fos_id_2_level_map.loc[fos]['level'] > level:
            continue
        else:
            filtered.append(fos)
    return len(filtered)

In [None]:
df_hep_fos = pd.DataFrame(article_fos_seqs, columns=['article_id', 'fos_ids'])

In [None]:
df_hep = df.merge(df_hep_fos, left_on='id', right_on='article_id', how='inner')

In [None]:
# filter to years with enough data
df_hep['created'] = pd.to_datetime(df_hep['created'])
df_hep['year_created'] = df_hep['created'].dt.year
df_hep = df_hep.sort_values('year_created')[(df_hep['year_created'] > 1991) & (df_hep['year_created'] < 2019)]

In [None]:
# get fos names for dictionary creation
fos_id_2_name_mapping = {fos_id: name for fos_id, name in zip(df_mag_fos['id'].values, df_mag_fos['name'].values)}
df_hep['fos_names'] = df_hep['fos_ids'].apply(lambda x: [fos_id_2_name_mapping[i] for i in x])
# count number of fos for each paper
df_hep['n_fos'] = [len(f) for f in df_hep['fos_ids']]

In [None]:
hep_n_fos_describe_df = df_hep.groupby('year_created')['n_fos'].describe()

In [None]:
hep_n_fos_describe_df.head(2)

In [None]:
df_hep.head(2)

### Create Graphs

In [None]:
from rhodonite.cooccurrence.cumulative import *

In [None]:
dictionary = Dictionary(df_hep['fos_names'])
df_hep['fos_d_ids'] = [dictionary.doc2idx(d) for d in df_hep['fos_names']]
# c = Counter(chain(*df_hep['fos_names']))

In [None]:
communities = []
for year, group in df_hep.groupby('year_created'):
    communities.append(group['fos_d_ids'].values)

In [None]:
steps = df_hep['year_created'].unique()

In [None]:
g, o_props, o_cumsum_props, co_props, co_cumsum_props = cumulative_cooccurrence_graph(
    steps=steps, sequences=communities)

#### I/O Graphs

In [None]:
for year, co in co_graphs.items():
    co.save('../data/processed/arxiv_co_graphs/arxiv_co_{}.gt'.format(year))

In [None]:
from graph_tool import Graph

In [None]:
graph_dir = '../data/processed/arxiv_co_graphs'
co_graphs = {}
for file in sorted(os.listdir(graph_dir)):
    year = file.split('.')[0][-4:]
    g = Graph(directed=False)
    g.load(os.path.join(graph_dir, file))
    co_graphs[int(year)] = g

In [None]:
def edges_2_dataframe(g):
    edge_df = pd.DataFrame(list(g.edges()), columns=['s', 't'], dtype='int')
    for k, ep in g.ep.items():
        vt = ep.value_type()
        if 'vector' not in vt:
            if ('int' in vt) | ('bool' in vt):
                edge_df[k] = ep.get_array()
                edge_df[k] = edge_df[k].astype(int)
            elif 'double' in vt:
                edge_df[k] = ep.get_array()
                edge_df[k] = edge_df[k].astype(float)
    return edge_df

def vertices_2_dataframe(g):
    vertex_df = pd.DataFrame(list(g.vertices()), columns=['v'], dtype='int')
    for k, vp in g.vp.items():
        vt = vp.value_type()
        if 'vector' not in vt:
            if ('int' in vt) | ('bool' in vt):
                vertex_df[k] = vp.get_array()
                vertex_df[k] = vertex_df[k].astype(int)
            elif 'double' in vt:
                vertex_df[k] = vp.get_array()
                vertex_df[k] = vertex_df[k].astype(float)
    return vertex_df

## Analysis

### Analysis Functions

In [None]:
def prop_dict_agg(prop_dict, aggfunc):
    agg = []
    for k, v in prop_dict.items():
        agg.append(aggfunc(v.a))
    return np.array(agg)

In [None]:
from graph_tool import GraphView

In [None]:
def is_numeric_prop(prop):
    vt = prop.value_type()
    if 'vector' in vt:
        return False
    elif 'bool' in vt:
        return True
    elif 'double' in vt:
        return True
    elif 'int' in vt:
        return True
    else:
        return False

def aggregate_edge_props(graph, aggfunc=np.mean):
    results = {}
    for prop_name, prop in graph.edge_properties.items():
        if is_numeric_prop(prop):
            results[prop_name] = aggfunc(prop.get_array())
    results['num_edges'] = graph.num_edges()
    return results

def aggregate_vertex_props(graph, aggfunc=np.mean):
    results = {}
    for prop_name, prop in graph.vertex_properties.items():
        if is_numeric_prop(prop):
            results[prop_name] = aggfunc(prop.get_array())
    results['num_nodes'] = graph.num_vertices()
    return results

def agg_props_to_df(graph_dict, prop_type='e', aggfunc=np.mean,
                    label_name='year', edge_filter=None, vertex_filter=None):
    records = []
    for label, graph in graph_dict.items():
        
        if (edge_filter is not None) | (vertex_filter is not None):
            if edge_filter is not None:
                efilt = graph.ep[edge_filter]
            else:
                efilt = None
            if vertex_filter is not None:
                vfilt = graph.vp[vertex_filter]
            else:
                vfilt=None
            gv = GraphView(graph, efilt=efilt, vfilt=vfilt)
            gv = Graph(gv, prune=True)
            if prop_type == 'e':
                graph_agg_props = aggregate_edge_props(gv, aggfunc=aggfunc)
            elif prop_type == 'v':
                graph_agg_props = aggregate_vertex_props(gv, aggfunc=aggfunc)
        else:
            if prop_type == 'e':
                graph_agg_props = aggregate_edge_props(graph, aggfunc=aggfunc)
            elif prop_type == 'v':
                graph_agg_props = aggregate_vertex_props(graph, aggfunc=aggfunc)
        graph_agg_props[label_name] = label
        records.append(graph_agg_props)
        if (edge_filter is not None) | (vertex_filter is not None):
            del gv
            
    return pd.DataFrame(records)
        

def edges_2_dataframe(graph):
    '''edges_2_dataframe
    Returns a dataframe with source and target columns.
    '''
    edge_df = pd.DataFrame(graph.get_edges(), columns=['s', 't', 'edge_index'])
    return edge_df

### Plotting

In [None]:
eprops_mean_df = agg_props_to_df(co_graphs)
vprops_mean_df = agg_props_to_df(co_graphs, prop_type='v')

In [None]:
eprops_sum_df = agg_props_to_df(co_graphs, aggfunc=np.sum)
vprops_sum_df = agg_props_to_df(co_graphs, aggfunc=np.sum, prop_type='v')

#### Nodes and Edges

In [None]:
n_vertices = prop_dict_agg(o_props, np.sum)

In [None]:
steps

In [None]:
plt.plot(steps, n_vertices)

In [None]:
fig, ax = plt.subplots(ncols=3, figsize=(15, 3.5))
ax[0].plot(vprops_mean_df.set_index('year')['num_nodes'])
ax[0].scatter(vprops_mean_df['year'].values, vprops_mean_df['num_nodes'].values)
ax[0].set_xlabel('Year')
ax[0].set_ylabel('N Nodes')
ax[1].plot(eprops_mean_df.set_index('year')['num_edges'])
ax[1].scatter(eprops_mean_df['year'].values, eprops_mean_df['num_edges'].values)
ax[1].set_xlabel('Year')
ax[1].set_ylabel('N Edges')
ax[2].scatter(vprops_mean_df['num_nodes'], eprops_mean_df['num_edges'])
ax[2].set_xlabel('N Nodes')
ax[2].set_ylabel('N Edges')
plt.tight_layout()
plt.savefig('../reports/hep_n_nodes_edges.png', dpi=300)
plt.show()

We have a 26 year timespan of publications. Each year, the cumulative number of topics in our publications grows. This corresponds to the total number of nodes in our knowledge graph. There are two growth regimes in our data. Between 1992 and 2005 we see a fast rise in the number of topics, which begins to slow down. After 2005, we see linear growth. This means that we see a constant growth rate, despite the 'circle of knowledge' increasing in size. 

The number of edges between our topics grows linearly across all years. This only tells us about the number of new edges formed between a pair of nodes, and not how many existing edges were reinforced. 

In the earlier half of our time period, the rate of edge growth was slower compared to the number of nodes. After 2005, the rate speeds up, but the growth is linear.

#### Number of Publications and Topics

In [None]:
sns.set_context('notebook', font_scale=1.1)

In [None]:
fig, axs = plt.subplots(ncols=3, figsize=(15, 3.5))
axs[0].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['count'])
axs[0].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['count'])
axs[0].set_xlabel('Year')
axs[0].set_ylabel('N Publications')

axs[1].scatter(hep_n_fos_describe_df['count'], vprops_mean_df['num_nodes'])
axs[1].set_xlabel('N Publications')
axs[1].set_ylabel('N Fields of Study')

axs[2].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['mean'].values)
axs[2].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['mean'], label='Mean')
axs[2].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['25%'].values, color='C3')
axs[2].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['25%'], color='C3', label='Lower 25%')
axs[2].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['75%'].values, color='C2')
axs[2].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['75%'], color='C2', label='Upper 75%')
axs[2].set_xlabel('Year')
axs[2].set_ylabel('Fields of Study')
axs[2].legend()

plt.tight_layout()
plt.savefig('../reports/hep_n_papers.png', dpi=300)
plt.show()

The growth rate in the number of high energy physics papers published on ArXiv each year was highest at the start of the time span, and decreases over time towards a linear growth regime from around 2005. This mirrors the cumulative node count in our knowledge graph, as seen previously. The provenance of the sharp decline in 2018 is unknown.

Indeed, the number of fields of study is proportional to the number of publications. In other words, the growth of topics discovered by our knowledge graph is proportional to the level of research effort in the system.

The average number of fields of study per publication shows a very slight decline over time, from around 4.75 in 1992 to 4.5 in 2017. A first assumption might be that publications have become narrower in focus over time, however another interpretation might be that topics studied in more recent papers are not yet encapsulated by a term in the taxonomy, whereas older publications will cover more established concepts that are more likely to be included. As the methods used to create field of study labels in Microsoft Academic Graph are unknown, this is currently an unknown for our own methodology.

#### Topic Containment

In [None]:
def containment(a, b):
    con = len(a.intersection(b))/ len(a)
    return con

cons = []
i = 0
for year, group in df_hep.groupby('year_created'):
    c = Counter(chain(*group['fos_d_ids']))
    s = set(c.keys())
    if i > 0:
        con = containment(s_old, s)
        cons.append(con)
    else:
        cons.append(np.nan)
    s_old = s.copy()
    i += 1
    
hep_n_fos_describe_df['fos_containment'] = cons

In [None]:
cons_cumsum = [np.nan]
i = 0
years = df_hep['year_created'].unique()
for year in years[1:]:
    s_old = set(chain(*df_hep[df_hep['year_created'] < year]['fos_d_ids']))
    s_new = set(chain(*df_hep[df_hep['year_created'] == year]['fos_d_ids']))
    cons_cumsum.append(containment(s_old, s_new))
    
hep_n_fos_describe_df['fos_cum_containment'] = cons_cumsum

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(10, 3.5))
ax[0].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['fos_containment'])
ax[0].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['fos_containment'])
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Containment of Topics at $T_{-1}$')
ax[0].set_ylim((0,1))

ax[1].plot(hep_n_fos_describe_df.index, hep_n_fos_describe_df['fos_cum_containment'])
ax[1].scatter(hep_n_fos_describe_df.index, hep_n_fos_describe_df['fos_cum_containment'])
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Containment of Cumulative Topics')
ax[1].set_ylim((0,1))

plt.tight_layout()
plt.savefig('../reports/containment.png')
plt.show()

The quantity of new knowledge connections that can be made will be determined somewhat by the continued study of topics over time. The graph here shows the level to which topics at a given year contain the topics in studied in the previous year. We know that the number of overall number topics grows with each year, and now we can also see that the publications in each new time period consistently capture 70% of the topics from the preceding year.

This tells us two things. First that there may be a consistent fraction of edges in one year that are a mix of either reinforcing edges or new edges between previously existing nodes. The other edges would be those formed between a pair of nodes that includes at least one new node. Second, due to the linear growth in the number of nodes, this fraction may be fairly constant, and provide an upper limit for the fraction of new edges that can be formed each year. In practice, this is unlikely to be reached as the number of possible edges far exceeds the number of publications, therefore we could think of it as contributing to a soft upper limit.

#### Share of Edges Types

In [None]:
fig, ax = plt.subplots(ncols=3, figsize=(15, 3.5))

ax[0].plot(eprops_mean_df.set_index('year')['new_co'][1:], label='New')
ax[0].plot(eprops_mean_df.set_index('year')['reinforcing_co'][1:], label='Reinforcing')
ax[0].plot(eprops_mean_df.set_index('year')['inactive_co'][1:], label='Inactive')
ax[0].set_ylabel('% Edge Type')
ax[0].set_xlabel('Year')
ax[1].plot(eprops_sum_df.set_index('year')['new_co'][1:], label='New')
ax[1].plot(eprops_sum_df.set_index('year')['reinforcing_co'][1:], label='Reinforcing')
ax[1].plot(eprops_sum_df.set_index('year')['inactive_co'][1:], label='Inactive')
ax[1].set_ylabel('Cooccurrence Count')
ax[1].set_xlabel('Year')

ax[2].plot(eprops_sum_df.set_index('year')['new_co'][1:], label='New')
ax[2].plot(eprops_sum_df.set_index('year')['reinforcing_co'][1:], label='Reinforcing')
ax[2].set_ylabel('Cooccurrence Count')
ax[2].set_xlabel('Year')

ax[0].set_ylim((0,1))
ax[0].legend()
ax[1].legend()
ax[2].legend()
plt.tight_layout()
plt.savefig('../reports/share_of_edge_types.png', dpi=300)
plt.show()

Here we plot the number of new, reinforcing and inactive edges. Any edge must fall into one of these categories, and they are defined as follows:

- **New**: an edge that connects two previously unconnected nodes.
- **Reinforcing**: an edge that was formed in a previous period and is covered by at least one publication in the current period.
- **Inactive**: an edge that was formed in a previous period and has not been covered by any publication in the current period.

Over time, we can see that the number of inactive edges follows the same trend as the total number of edges in our knowledge graph. Inactive edges also dominate the graph at all years after the second year in our timespan, with their share growing from around 60% to over 90% between 1995 and 2018. In contrast, the proportion of new and reinforcing edges decreases as time goes on. As we might expect, reinforcing edges are more common than new edges, implying that a greater propotion of work done each year is at the intersection of two topics that were connected by previous researchers.

What the proportions do not tell us is whether the overall number of new and reinforcing edges is actually growing over time. The absolute numbers show an interesting story. The number of inactive edges in each year marches steadily higher with a high rate of linear growth. Reinforcing edges also grow roughly linearly, though at a much lower volume and at a rate slower than inactive edges. However, new edges show no growth at all. The same number are created each year despite the fact that the numbers of publications and nodes both grow.

Why is this? Perhaps because as we saw before, new topics become rarer with time, or perhaps another reason. Despite the growth in the number of topics, does it becomes harder for some reason to find completely new knowledge connections. Are the number of new edges driven by the number of new nodes or new connections between existing nodes? We should break the new edges down into further sub-categories to understand their dynamics.

We can also further investigate the dynamics of reinforcing and inactive edges. Are reinforcing edges always in the same places? How old is the average inactive edge? How are the coccurrences that form new and reinforcing edges distributed? (i.e. are they concentrated on a small number of edges?) 

#### Filtering Edges by Node Type

Let's add another property to our nodes to signal whether they have been added that year or existed from a previous year.

In [None]:
for year, co in co_graphs.items():
    new_topic = co.new_vertex_property('bool')
    co.vp['new_topic'] = new_topic
    prev_year = year - 1
    if prev_year in co_graphs:
        prev_num_topics = co_graphs[prev_year].num_vertices()
        for v in co.vertices():
            if v >= prev_num_topics:
                co.vp['new_topic'][v] = True
            else:
                co.vp['new_topic'][v] = False
    else:
        co.vp['new_topic'].a = [True for _ in range(co.num_vertices())]

And now lets add a new edge property that tells us whether our edges connect new nodes, existing nodes, or a combination.

In [None]:
for year, co in co_graphs.items():
    
    new_linkage = co.new_edge_property('bool')
    mixed_linkage = co.new_edge_property('bool')
    old_linkage = co.new_edge_property('bool')
    
    for e in co.edges():
        s_new = co.vp['new_topic'][e.source()]
        t_new = co.vp['new_topic'][e.target()]
        new_edge = co.ep['new_co'][e]
        if new_edge:
            if s_new and t_new:
                news += 1
                new_linkage[e] = True
                mixed_linkage[e] = False
                old_linkage[e] = False
            elif s_new and not t_new:
                mixeds += 1
                new_linkage[e] = False
                mixed_linkage[e] = True
                old_linkage[e] = False
            elif t_new and not s_new:
                mixeds += 1
                new_linkage[e] = False
                mixed_linkage[e] = True
                old_linkage[e] = False
            else:
                olds += 1
                new_linkage[e] = False
                mixed_linkage[e] = False
                old_linkage[e] = True
        else:
            oldens += 1
            new_linkage[e] = False
            mixed_linkage[e] = False
            old_linkage[e] = False
            
    co.ep['new_linkage'] = new_linkage
    co.ep['mixed_linkage'] = mixed_linkage
    co.ep['old_linkage'] = old_linkage

In [None]:
eprops_new_linkage_mean_df = agg_props_to_df(co_graphs, edge_filter='new_linkage')
eprops_mix_linkage_mean_df = agg_props_to_df(co_graphs, edge_filter='mixed_linkage')
eprops_old_linkage_mean_df = agg_props_to_df(co_graphs, edge_filter='old_linkage')

In [None]:
fig, ax = plt.subplots()
ax.plot(eprops_new_linkage_mean_df['year'].values[1:], eprops_new_linkage_mean_df['num_edges'].values[1:],
     label='New Topics')
ax.plot(eprops_new_linkage_mean_df['year'].values[1:], eprops_mix_linkage_mean_df['num_edges'].values[1:],
    label='Mixed Topics')
ax.plot(eprops_new_linkage_mean_df['year'].values[1:], eprops_old_linkage_mean_df['num_edges'].values[1:],
    label='Existing Topics')
ax.set_xlabel('Year')
ax.set_ylabel('N Edges')
ax.legend()

plt.tight_layout()
plt.savefig('../reports/new_edge_types.png', dpi=300)
plt.show()

Here we can see the breakdown of new edges into 3 further sub-categories: those formed between two new topics, those formed between a new topic and an existing topic, and those formed between two existing topics that were not yet connected. The first two types show a slow decline in absolute numbers over time, although there are consistenly more mixed edges than ones between two new topics. New edges between existing topics stay at a relatively constant level throughought the timespan. The relative likelihoods between the three categories are in line with what we might expect; it is easier to create links that involve topics which already exist within the network. Another conclusion is that new topics are usually introduced to the knowledge graph by papers that combine them with a topic that is already known. New papers about topics that are all new are likely to be very rare. 

Is this decline just a product of the fact that most topics are mentioned on a sub-annual basis? In this case, the number of new topics will be higher at periods near the start of the timespan even if they are active areas of study.

We should investigate this and perhaps use age instead of the binary categorisation between new and not new.

#### Vertex Age and Frequency

In [None]:
for year, co in co_graphs.items():
    age = co.new_vertex_property('int')
    appearances = co.new_vertex_property('int')
    frequency = co.new_vertex_property('double')
    prev_year = year - 1
    if prev_year in co_graphs:
        co_prev = co_graphs[prev_year]
    # age
    num_prev_vertices = co_prev.num_vertices()
    for v in co.vertices():
        if co.vp['new_topic'][v]:
            age[v] = 1
            appearances[v] = 1
        else:
            age[v] = int(co_prev.vp['age'][int(v)]) + 1
            if co.vp['o'][v] == 0:
                appearances[v] = int(co_prev.vp['appearances'][int(v)])
            else:
                appearances[v] = int(co_prev.vp['appearances'][int(v)]) + 1

    frequency.a = appearances.get_array() / age.get_array()
    
    co.vp['age'] = age
    co.vp['appearances'] = appearances
    co.vp['frequency'] = frequency 

In [None]:
fig, ax = plt.subplots(ncols=3, figsize=(15, 3.5))
ax[0].plot(vprops_mean_df['year'], vprops_mean_df['age'])
ax[1].plot(vprops_mean_df['year'], vprops_mean_df['appearances'])
ax[2].plot(vprops_mean_df['year'], vprops_mean_df['frequency'])
ax[0].scatter(vprops_mean_df['year'], vprops_mean_df['age'])
ax[1].scatter(vprops_mean_df['year'], vprops_mean_df['appearances'])
ax[2].scatter(vprops_mean_df['year'], vprops_mean_df['frequency'])
for a in ax:
    a.set_xlabel('Year')
ax[0].set_ylabel('Mean Topic Age')
ax[1].set_ylabel('Mean Topic Appearances')
ax[2].set_ylabel('Mean Topic Frequency')
plt.show()

We now define three new properties of a topic:
- **Age**: the number of years since it was first introduced to the knowledge network
- **Appearances**: the number of years that it has been mentioned in a publication
- **Freqency**: appearances / age

Here we plot the three properties. We can see that the mean age of topics increases linearly over time, rising by 0.67 years for each year in the timespan. This is due to the new nodes being added to the network.

The number of appearances shows a different trend, tapering off slightly. This could mean that topics are revisted more infrequently over time.

This appears to be the case from the third plot, which shows a continuing decrease in frequency over the timespan available.

#### Node and Edge DataFrames

We can also explore properties at the individual node and vertex level. This is important if we want to calculate a K Score at the publication level.

In [None]:
v_co_dfs = []
e_co_dfs = []
for year, co in co_graphs.items():
    v_co_df = vertices_2_dataframe(co)
    v_co_df['year'] = year
    v_co_dfs.append(v_co_df)
    
    e_co_df = edges_2_dataframe(co)
    e_co_df['year'] = year
    e_co_dfs.append(e_co_df)
    
v_co_df = pd.concat(v_co_dfs)
e_co_df = pd.concat(e_co_dfs)

In [None]:
mean_k_scores = []
for year, group in df_hep.groupby('year_created'):
    co = co_graphs[year]
    for ids in group['fos_d_ids']:
        k_scores = []
        for combo in combinations(sorted(ids), 2):
            if co.edge(combo[0], combo[1]) is not None:
                k_scores.append(co.ep['k_score'][combo])
        mean_k_scores.append(np.mean(k_scores))
        
df_hep['k_score'] = mean_k_scores

In [None]:
fig, ax = plt.subplots()
ax.plot(df_hep.groupby('year_created')['k_score'].mean())
ax.set_xlabel('Year')
ax.set_ylabel('Mean Publication K Score')
plt.tight_layout()
plt.savefig('../reports/mean_annual_k_score.png', dpi=300)
plt.show()

For each publication, we query the edges of the graph at the relevant year, that correspond to the pairwise combinations of topics that it refers to. For each combination, we take the K Score and then take the average for all of them. Papers with many new edges will have a higher K Score than those with a high proportion of reinforcing edges. We can see that over time, the average score decreases and then settles around 0.3. Interestingly, this is at odds with the calculations from the WR Report data that shows a continuing decrease over a similar timespan.

In [None]:
fig, ax = plt.subplots(figsize=(12, 3.5))
sns.stripplot(df_hep['year_created'], df_hep['k_score'], color='C0', alpha=0.02, jitter=0.4, ax=ax)
plt.setp(ax.get_xticklabels(), rotation=90)
ax.set_xlabel('Year')
ax.set_ylabel('K Score')
plt.show()

If we try to plot the K Score for every publication over time, we can see that there are some common modes at 1, 0.5 and 0.33. This is likely due to the fact that there are so many edges which are visited only a handful of times.

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(10, 3.5))

n_fos_agg = df_hep[df_hep['year_created'] > 2000]
n_fos_agg_mean = n_fos_agg.groupby('n_fos')['k_score'].mean()
ax[0].scatter(n_fos_agg_mean.index, n_fos_agg_mean.values)
ax[0].set_xlabel('N Fields of Study')
ax[0].set_ylabel('Mean K Score')

ax[1] = sns.violinplot(n_fos_agg['n_fos'], n_fos_agg['k_score'], color='C0')
ax[1].set_xlabel('N Fields of Study')
ax[1].set_ylabel('K Score')

plt.tight_layout()
plt.savefig('../reports/n_fos_k_score.png', dpi=300)
plt.show()

Finally, we look at the trend of mean K Score by the number of fields of study contained within a paper. We do this publications published from 2001 and onwards, as this is the stable period wrt K Score during our timespan. We can see a relatively flat trend between publications that have only 3 FoS and those which have 8. 

Looking at the distributions in K Score for publications with different numbers of FoS, we can see a very interesting trend. As the number of fields of study increases, the proportion of publications with a very high and very low K Score decreases. This tells us that by covering more topics, the K score converges, however we're not sure whether this is because averaging a higher number of topics has a balancing effect, or whether publications that cover a larger number of topics are inherently less novel. What is also unclear is whether this is a real effect, or an artefact of the taxonomy and Microsoft Academic Graph's labelling algorithm.

**To Do**
- Is a new edge predictive of increased activity in that area above what we might expect among the population?
- Or is an edge connecting a new node predictive of a higher publication activity in that area?
- Does connecting to a new node predict that other nodes within the vicinity of the edge are more likely to connect next year?
- Create edge property for growing edges (or can this already be deduced from existing edge props?)
- Find node overlap each year
- Understand the difference in the publication level K Score trend between ArXiv and WR data.
- Look at the breakdown in K Score by edge for publications with different numbers of FoS. Does the breadth of topics covered by a publication actually impact the distribution of 

**Some caveats so far**
- Decrease in average number of labels per paper could be that newer topics don't yet have a defined label.
- Might need to normalise by number of papers in each year

**Improvements**
- Calculate age of edge using groupby

In [None]:
df_hep.to_csv('../data/processed/hep_arxiv_publications.csv', index=False)

In [None]:
v_co_df.to_csv('../data/processed/hep_co_vertices.csv', index=False)
e_co_df.to_csv('../data/processed/hep_co_edges.csv', index=False)