### Network growth

1. Search "History" in section headings.
2. Find a list of years.
3. Use the first one as when the birth year of the idea.
4. If no "History" section, then look backwards in the graph to a page that does, and use that.

#### Visualization
* Use ~~[bokeh](https://bokeh.pydata.org/en/latest/docs/user_guide/graph.html)~~ ~~[webweb](https://webwebpage.github.io/)~~ [d3](https://observablehq.com/@d3/force-directed-graph).
* Use slider to show growth in network

### Label with years

In [14]:
import networkx as nx

path_saved = '/Users/harangju/Box Sync/Research/my papers/wikipedia paper/data/graphs/full_page/undated/'
# natural & physical sciences
topics = ['anatomy', 'biochemistry', 'cognitive science', 'evolutionary biology',
          'genetics', 'immunology', 'molecular biology']
topics += ['chemistry', 'biophysics', 'energy', 'optics', 
           'earth science', 'geology', 'meteorology']
# philosophy
# topics += []
topics += ['philosophy of language', 'philosophy of law', 
           'philosophy of mind', 'philosophy of science']
# social sciences
topics += ['economics', 'accounting', 'education', 'linguistics', 'law', 'psychology', 'sociology']
# technology & applied sciences
topics += ['electronics', 'software engineering', 'robotics']
topics += ['physics']

graphs = {}
for topic in topics:
    graphs[topic] = nx.read_gexf(path_saved + topic + '.gexf')
graphs

{'anatomy': <networkx.classes.digraph.DiGraph at 0x7fda814b6b38>,
 'biochemistry': <networkx.classes.digraph.DiGraph at 0x7fda9227d4e0>,
 'cognitive science': <networkx.classes.digraph.DiGraph at 0x7fda924230b8>,
 'evolutionary biology': <networkx.classes.digraph.DiGraph at 0x7fda82535390>,
 'genetics': <networkx.classes.digraph.DiGraph at 0x7fda922fc390>,
 'immunology': <networkx.classes.digraph.DiGraph at 0x7fda824b9438>,
 'molecular biology': <networkx.classes.digraph.DiGraph at 0x7fda81561588>,
 'chemistry': <networkx.classes.digraph.DiGraph at 0x7fda824b9128>,
 'biophysics': <networkx.classes.digraph.DiGraph at 0x7fda4cbc0898>,
 'energy': <networkx.classes.digraph.DiGraph at 0x7fda4c5e02b0>,
 'optics': <networkx.classes.digraph.DiGraph at 0x7fda9237b128>,
 'earth science': <networkx.classes.digraph.DiGraph at 0x7fdab1a40d30>,
 'geology': <networkx.classes.digraph.DiGraph at 0x7fda8234eb70>,
 'meteorology': <networkx.classes.digraph.DiGraph at 0x7fda825a6e80>,
 'philosophy of langu

In [15]:
import wiki

path_base = '/Users/harangju/Developer/data/wiki/'
name_xml = 'enwiki-20190801-pages-articles-multistream.xml.bz2'
name_index = 'enwiki-20190801-pages-articles-multistream-index.txt.bz2'
path_xml = path_base + name_xml
path_index = path_base + name_index
dump = wiki.Dump(path_xml, path_index)

In [16]:
for topic, graph in graphs.items():
    print('Topic: ' + topic)
    for node in graph.nodes:
        dump.load_page(node)
        graph.nodes[node]['year'] = dump.years[0] if len(dump.years)>0 else []

Topic: anatomy
Dump: Loading index...
Dump: Loaded.
Topic: biochemistry
Topic: cognitive science
Topic: evolutionary biology
Topic: genetics
Topic: immunology
Topic: molecular biology
Topic: chemistry
Topic: biophysics
Topic: energy
Topic: optics
Topic: earth science
Topic: geology
Topic: meteorology
Topic: philosophy of language
Topic: philosophy of law
Topic: philosophy of mind
Topic: philosophy of science
Topic: economics
Topic: accounting
Topic: education
Topic: linguistics
Topic: law
Topic: psychology
Topic: sociology
Topic: electronics
Topic: software engineering
Topic: robotics
Topic: physics


In [18]:
path_dated_unfilled = '/Users/harangju/Box Sync/Research/my papers/wikipedia paper' +\
                      '/data/graphs/full_page/dated/unfilled/'
for topic, graph in graphs.items():
    nx.write_gexf(graph, path_dated_unfilled + topic + '.gexf')

In [19]:
graphs['robotics'].nodes['Robotics']

{'label': 'Robotics', 'year': 1900}

#### Fill in nodes without years

In [20]:
def fill_empty_nodes(graph, full_parents=True):
    """
    Returns
    -------
    bool
        whether at least 1 empty node was filled
    """
    empty_nodes = [n for n in graph.nodes if not graph.nodes[n]['year']]
    for node in empty_nodes:
        years = [graph.nodes[p]['year'] for p in graph.predecessors(node)]
        if not years:
            continue
        if full_parents:
            if [] not in years:
                graph.nodes[node]['year'] = max(years)
                return True
        else:
            years_filtered = [y for y in years if y]
            if years_filtered:
                graph.nodes[node]['year'] = max(years_filtered)
                return True
    return False

In [21]:
for topic, graph in graphs.items():
    print('Topic:', topic)
    nodes = graph.nodes
    nodes_filled = True
    while nodes_filled:
        nodes_filled = fill_empty_nodes(graph, full_parents=True)
    nodes_filled = True
    while nodes_filled:
        nodes_filled = fill_empty_nodes(graph, full_parents=False)

Topic: anatomy
Topic: biochemistry
Topic: cognitive science
Topic: evolutionary biology
Topic: genetics
Topic: immunology
Topic: molecular biology
Topic: chemistry
Topic: biophysics
Topic: energy
Topic: optics
Topic: earth science
Topic: geology
Topic: meteorology
Topic: philosophy of language
Topic: philosophy of law
Topic: philosophy of mind
Topic: philosophy of science
Topic: economics
Topic: accounting
Topic: education
Topic: linguistics
Topic: law
Topic: psychology
Topic: sociology
Topic: electronics
Topic: software engineering
Topic: robotics
Topic: physics


In [23]:
path_dated_filled = '/Users/harangju/Box Sync/Research/my papers/wikipedia paper' +\
                    '/data/graphs/full_page/dated/filled/'
for topic, graph in graphs.items():
    nx.write_gexf(graph, path_dated_filled + topic + '.gexf')

#### Load graphs

In [None]:
topics = ['biochemistry']
path_dated_filled = '/Users/harangju/Box Sync/Research/my papers/wikipedia paper' + 
                    '/data/graphs/full_page/dated/filled/'

In [None]:
import networkx as nx

graphs = {}
for topic in topics:
    graphs[topic] = nx.read_gexf(path_dated_filled + topic + '.gexf')
graphs

### Visualize growth

* gephi [tutorial](https://seinecle.github.io/gephi-tutorials/generated-html/converting-a-network-with-dates-into-dynamic.html)

### Clique community growth

In [None]:
import networkx as nx

In [None]:
# cliques = nx.enumerate_all_cliques(nx.Graph(graphs['biochemistry']))
cliques = nx.find_cliques(nx.Graph(graphs['biochemistry']))
list(cliques)[100:110]

### Persistent homology